High Query latency in Weaviate

Description

Dimensions: 1536 and number of objects: 500k
We are observing high latency 490ms(p50),640ms(p90), 1.4s(p99) for throughput 32 RPS, Pod have enough CPU and memory available during the test however it is not utilizing fully. These results are not even close to benchmarks performed by Weaviate. Any suggestions to reduce the latency?

Server Setup Information

  • Weaviate Server Version: 1.22.8
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: single node
  • Client Language and Version: python v3
  • Multitenancy?: No

Any additional Information

hi @hanumanhuda !!

Curious, is there a reason to use 1.22.8?

Have you tried latest 1.26? There is A LOT of changes that may fix something that may be causing this.

Have you used our oficial helm chart?
Are there any limits in place for this cluster at k8s?
Have you changed any environment variables regarding resource planning?

Thanks!

1 Like

No, we haven’t used the helm chart for above tests, however we can try that with latest version 1.26. We haven’t changed any environment variable and going with default configuration, is there any specific recommendation for this use-case to make it faster?

Hi @hanumanhuda,

Building on @DudaNogueira’s recommendation, I wanted to share some insights that could help improve your setup:

  1. Running multiple nodes can lead to noticeable improvements. This approach improve the query as distribution across multiple nodes.
  2. With a cluster setup (minimum 3 nodes), you’ll be able to set a replication factor of 3. The default consistency level for this is Quorum, which should work well for most cases.

To boost performance, you can set the consistency level to ONE for queries. While this trades some consistency for speed, Weaviate handles background consistency for you with repairs, so you don’t need to worry.

Here is detailed technical explanations:

Let us know if you have any other questions or if there’s anything else we can help with!

1 Like

hi @hanumanhuda !

Give enough resources, 500k objects will run lightning fast.

In order to get better latencies, as mentioned by my friend and colleague Mohamad, using newer versions than the one you are using is important as it leverages GRPC.

The version you are using will only expose REST/HTTP endpoints. GRPC will help it here tremendously, as it is way faster.

on top of that, there are a lot of other improvements.

Here you find more information on GRPC:

And here you can have all information needed to run Weaviate using our oficial helm chart:

The best way to migrate here, considering you have not used our helm in the first place, is to spin up a new Weaviate cluster using that helm, and migration your data over using this migration guide:

Let me know if this helps!

1 Like

We tested the latest version 1.26.1 with 3 shards on a collection containing 500k objects, using the default settings for GOMEMLIMIT and GOMAXPROCS. The results were as follows:

  • Latency for 32 RPS:
    • P50: 300 ms
    • P90: 490 ms
    • P99: 740 ms
  • Latency for 2 RPS:
    • P50: 76 ms
    • P90: 87 ms
    • P99: 92 ms

We haven’t enabled replication since our primary goal is to reduce latency, not just increase availability. Although we are currently working with 500k objects, our target is to scale a single collection to 5 million objects.

One key observation during these tests was that none of the nodes fully utilized the available compute resources. As RPS increased, latency also increased, even though there was ample memory and compute available across the nodes. This raises the question: Is there a bottleneck preventing the utilization of compute resources across multiple queries?

hi @hanumanhuda !

I believe you should try tweaking those parameters according to your deployment.

Also, consider that replication will also give you better room for higher QPS: Use Cases (Motivation) | Weaviate

Other index configs you can tune for better QPS are ef, efConstruction and maxConnections. More on those options here: Vector indexes | Weaviate

Let me know if this helps.

Thanks!

We have tried with replication factor 2 and consistency level 1 during the read, however response time is quite higher than replication factor 1 setup, which was surprising for us. It was using memory 2 times however compute remained same.

Any further update on this?

Did you get the same results with replication factor 3?

No, we didn’t do it for RF 3 If it is not improving with 2 replicas then it didn’t make sense for us to go for 3 replicas. We have used the consistency level as one with 2 replicas.

We are seeing interesting aspects on memory usage for 1.5M chunks(D:1536), and it is using 28 GB during queries otherwise 15GB. As per calculations it would be using at max 12GB.

Also is there guideline how many shards should be used to get best performance based on number of chunks?

Hey,

I saw that you were starting with 1.22.6+python v3 - did you update to python v4 when you updated the weaviate version?

yes, we did try with V4 client(GRPC), dense queries latency has improved by 40%, however sparse queries has worsen the performance by 40% for P99. This was surprise for us as we have changed client with V4 from v3 and enabled GRPC on server. Is this expected for V4 clients?