Optimizing Imports between number of nodes & pods

Lakshya_Bakshi · October 13, 2023, 9:29pm

Hi everyone,

I’m working on a script to import around ~200 million records and want it to run around 6 times faster. I have 4 weaviate instances (each node with 4 threads and one weaviate instance per node). I only replicate my data once, and have 4 total shards (1 shard per instance). I’m tuning the parallelism level on my data loading script (right now have found the best performance at 12 threads). The batch size that has reached the best performance is about 5000 records per batch.

To back this, I also have 1 GPU running 1 instance of the inference model.

If I want this to run faster, what are the obvious bottlenecks in this set up? I was surprised when I bumped my number of weaviate instances (nodes) to 8 and added 8 shards, and performance worsened.

Is there a benefit to having more than 1 shard per node (as it relates to imports)?

DudaNogueira · October 16, 2023, 3:04pm

Hi @Lakshya_Bakshi

I assume you have seen this doc already:

There are some new upcoming features that will help improving import time, like async batch and others.

I believe you pretty much covered the options here. Also consider that 200 milion is a lot, and after importing those data Weaviate still needs to index them into a vector space, so there is a lot going on here.

Let me know if that helps.

Topic		Replies	Views
Does weaviate parallelize imports over multiple GPUs? General	1	522	November 21, 2023
About import speed Support	1	703	January 22, 2024
Parallel Batch Operations and Consistency Level Support developer-experience	2	750	January 30, 2024
Increase number of shards and update HNSW vector index parameters Support python	6	1060	August 28, 2024
How do virtual shards work when upscaling? Support	2	1046	November 28, 2023

Optimizing Imports between number of nodes & pods

Related topics