The search for a very large number of documents does not work

August9299 · March 28, 2024, 4:45pm

Hello everybody! I’m new to using Weaviate and have only recently started using it, starting with Verba.

I’ve installed it as

git clone https://github.com/weaviate/Verba
pip install -e .

I put my OpenAI key to an .env file, so Verba uses OpenAI.
I successfully launched it and started uploading documents on the topic necessary for my work. Each document is .txt, but inside it looks like this json describing the objects we would like to create a search for:

{
  "serial-number": "123456",
  "company-name": "Some name",
  "owners": [
    "owner 1",
    "owner 2",
    "owner 3"
  ],
  "statements": [
    {
      "code": "123",
      "description": "some description"
    }
  ],
  "status": 123,
  "date": "2024-02-09"
}

After that, I tested Verba and saw that it easily helps me find records for my queries. For example, if I search for ‘a company “Some name”’, it will show me this document above.
The problem has arisen since I uploaded a large number of documents (now there are ~ 700 thousand documents, but in our plan there should be ~4 million documents with the same structure).
Using the same query, Verba does not show information for every existing document. Even if I see a document on the Documents page, it can’t be found during my search query.
I tried to increase the QUERY_MAXIMUM_RESULTS to 1 million and it didn’t help. I changed the line Verba/goldenverba/verba_manager.py at main · weaviate/Verba · GitHub like this:
embedded_options=EmbeddedOptions(additional_env_vars={‘QUERY_MAXIMUM_RESULTS’: ‘1000000’})

Please guide me what can I do to overcome this problem?
Should I change my documents structure or any program code?
Thank you in advance very much!

Server Setup Information

Weaviate Server Version: 1.21.1
Deployment Method: embedded
Multi Node? no
Client Language and Version: Python 3.10

Any additional Information

Please tell me if any info is needed. I just followed this Verba guide Verba/README.md at main · weaviate/Verba · GitHub

verba_config.json generated like this:
{
“reader”: “SimpleReader”,
“chunker”: “TokenChunker”,
“embedder”: “MiniLMEmbedder”,
“retriever”: “WindowRetriever”,
“generator”: “GPT4Generator”
}

DudaNogueira · April 1, 2024, 4:39pm

Hi! Welcome to our community!

Do you see the same results when you query Weaviate directly, bypassing verba?

Bear in mind that Verba is a demo project, not a product. It was built to show off what Weaviate can do.

For example, it uses the Embedded Option for running, that is experimental and certainly not suited for this number of objects.

Thanks!

Topic		Replies	Views
Maximum search limit of 10000 results has been reached Support	5	1230	April 26, 2024
[Question] Go, Often unable to obtain data Support	1	97	May 14, 2024
Verba stuck with older informations General	4	280	May 22, 2024
Max file size for pdf imports & Connection Interruption Error Support bug , developer-experience , technical	1	171	November 13, 2024
[Question] How to retrieve all documents in weaviate? Support	2	416	August 14, 2024

The search for a very large number of documents does not work

Server Setup Information

Any additional Information

Related topics