Explosive growth (to 10sec) of request latency when one cluster's node fails

Description

We have a weaviate cluster with 3 nodes.
Average request latency (crud/vsearch+QUORUM) 10-70ms.
When one node fails (or a pod restarts), eg: “weaviate-2”, the requests latency increases to 10sec for all requests directed to “weaviate-1” (visible in the log). But all requests directed to “weaviate-0” remain fast (10-70ms). Regardless of which node is down: one remaining node is “slow” and the other is “fast”. This both valid for single requests or high load…

Server Setup Information

  • Weaviate Server Version: 1.25.6
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 3 nodes (repl.factor=3)
  • Client Language and Version: Python3, PythonClient3
  • Multitenancy?: No

Any additional Information

“replicationConfig”: {
“factor”: 3
},

env:
- name: RAFT_JOIN
value: weaviate-0,weaviate-1,weaviate-2
- name: RAFT_BOOTSTRAP_EXPECT
value: ‘3’

resources:
limits:
cpu: ‘50’
memory: 500Gi
requests:
cpu: ‘50’
memory: 500Gi

Hi @wvuser,

Thanks for the report. You are correct, this is definitely unexpected. If a QUORUM can still be achieved and you only send your requests to nodes that report ready, there is no reason for a delay – especially not such a massive one.

I’ll ping a few folks from our DB Core team to help narrow this down. They will probably ask you a few more detailed questions about the setup.

Is the problem you are describing reproducible? If so a minimal reproducing example would be highly appreciated, as that can speed up a potential fix.

Best,
Etienne

Hello @wvuser,

This is José Luis, QA at Weaviate. Thanks a lot for reporting this issue. I did manage to reproduce it with the steps you provided.

The good news is that this issue has been already fixed by one of our core developers and it’s on its way to the next release 1.25.12 (I managed to reproduce it in 1.25.11, but couldn’t reproduce it on the candidate release for 1.25.12). So updating to the upcoming 1.25.12 (it will be released along the week) should get rid of the high latencies during pod restarts.

Once more, thanks for your feedback and for reporting the issue, it’s very valuable to us and the whole community.

Regards,
José Luis

2 Likes

Hi!

Updating to v.1.25.12 fixed the problem, thanks. But there are some comments/questions:
This env. variables must be explicitly specified (1.25.6 works fine without them, other def.values?) for cluster’s fault tolerance:
DISABLE_LAZY_LOAD_SHARDS: ‘true’ (mandatory)
HNSW_STARTUP_WAIT_FOR_VECTOR_CACHE: ‘true’
Otherwise the system does not respond to any requests (infinite read timeout?). All pods quickly restarts sequentally (~1min), report ‘ready’ to k8s and begin shards/logs loading. But at this time (all at ‘hnsw_deserialization’ state?) requests hang (no response).

Hello @Jose_Luis_Franco!

Is this fix available in version 1.26.1 (Jul 23)?
We have a problem with replica’s synchronization (not fixed ‘deletes’ when nodes failed/restarts) and manual ‘async-read-all’ not helps. We are thinking about upgrade to 1.26.