Hello everybody! I’m new to using Weaviate and have only recently started using it, starting with Verba.
I’ve installed it as
git clone https://github.com/weaviate/Verba
pip install -e .
I put my OpenAI key to an .env file, so Verba uses OpenAI.
I successfully launched it and started uploading documents on the topic necessary for my work. Each document is .txt, but inside it looks like this json describing the objects we would like to create a search for:
{
"serial-number": "123456",
"company-name": "Some name",
"owners": [
"owner 1",
"owner 2",
"owner 3"
],
"statements": [
{
"code": "123",
"description": "some description"
}
],
"status": 123,
"date": "2024-02-09"
}
After that, I tested Verba and saw that it easily helps me find records for my queries. For example, if I search for ‘a company “Some name”’, it will show me this document above.
The problem has arisen since I uploaded a large number of documents (now there are ~ 700 thousand documents, but in our plan there should be ~4 million documents with the same structure).
Using the same query, Verba does not show information for every existing document. Even if I see a document on the Documents page, it can’t be found during my search query.
I tried to increase the QUERY_MAXIMUM_RESULTS to 1 million and it didn’t help. I changed the line Verba/goldenverba/verba_manager.py at main · weaviate/Verba · GitHub like this:
embedded_options=EmbeddedOptions(additional_env_vars={‘QUERY_MAXIMUM_RESULTS’: ‘1000000’})
Please guide me what can I do to overcome this problem?
Should I change my documents structure or any program code?
Thank you in advance very much!
Server Setup Information
- Weaviate Server Version: 1.21.1
- Deployment Method: embedded
- Multi Node? no
- Client Language and Version: Python 3.10
Any additional Information
Please tell me if any info is needed. I just followed this Verba guide Verba/README.md at main · weaviate/Verba · GitHub
verba_config.json generated like this:
{
“reader”: “SimpleReader”,
“chunker”: “TokenChunker”,
“embedder”: “MiniLMEmbedder”,
“retriever”: “WindowRetriever”,
“generator”: “GPT4Generator”
}