I’m currently dealing with Weaviate classes that are expected to contain over 100,000 objects, and I continuously add to and remove objects from these classes. Now, as I import objects into newly created classes, which again will accumulate more than 100,000 objects, the process seems inefficient. It takes upwards of a minute to import just a batch of 50 objects.
I have 24GB Ram, 8 vcpu allocated for Weaviate in ECS. Async indexing is enabled.
Given this context, I suspect that the significant delay is due to Weaviate creating a new HSNW index each time new objects (and their associated vectors) are imported, which is inherently costly.
With this in mind, I have a couple of questions:
- Is the HSNW indexing approach suboptimal for classes containing such a high volume of objects, where each object is represented by a 1024-length vector?
- In terms of HSNW indexing, does it help weaviate to know that there are 100.000 objects to index before starting indexing, instead of feeding objects 50 by 50?
3- I checked the mutable fields on the documentation, and vectorizerIndexConfig->skip is one of them. Does this mean, i could skip indexing during import, and then set it to true? Would this start indexing once it is set to true, while not slowing me down during uploads?
Any insights on managing large-scale imports more efficiently in Weaviate, especially when HSNW indexing is involved, would be greatly appreciated.
Hi @mnkasikci!
Great questions! Thanks!
I am assuming you are already using latest Weaviate server version (1.23.8).
Also using the python client v4, as it uses GRPC, and greatly improves performance.
The import process is very CPU bound, and because of that, you may see a import rate reduction over time when importing a lot of objects in a short period of time, as Weaviate will needs to both receive new objects and index the received ones.
The ASYNC INDEX is experimental. I have not played with it yet But I plan on covering that soon in our recipes.
So you will need to reach a balance on that. Because there is a lot of differences in the data being indexed (size, properties, images, skip, etc), there isn’t a one size fits all.
A good practice is to start with a small portion of the data set, and increase the batch size and workers, monitoring for resource consumption both on client and server.
Regarding your questions:
-
100.000 objects isn’t a high volume at all. It can go way beyod. Something around 2Gb is probably enough to serve those 100.000 objects.
-
Not sure It does as the index is built based on those vectors, and on the ammount of them. Considering the ASYNC INDEX, maybe is better to fill in as much as objects you can thru a bigger batch.
-
The skip is about using or not that property to generate the vectorization. If you are providing the vectors yourself (not sure that is your case) Weaviate will not vectorize that object for your.
Let me know if this helps