Explosive growth (to 10sec) of request latency when one cluster's node fails

wvuser · August 18, 2024, 7:38pm

Description

We have a weaviate cluster with 3 nodes.
Average request latency (crud/vsearch+QUORUM) 10-70ms.
When one node fails (or a pod restarts), eg: “weaviate-2”, the requests latency increases to 10sec for all requests directed to “weaviate-1” (visible in the log). But all requests directed to “weaviate-0” remain fast (10-70ms). Regardless of which node is down: one remaining node is “slow” and the other is “fast”. This both valid for single requests or high load…

Server Setup Information

Weaviate Server Version: 1.25.6
Deployment Method: k8s
Multi Node? Number of Running Nodes: 3 nodes (repl.factor=3)
Client Language and Version: Python3, PythonClient3
Multitenancy?: No

Any additional Information

“replicationConfig”: {
“factor”: 3
},

env:
- name: RAFT_JOIN
value: weaviate-0,weaviate-1,weaviate-2
- name: RAFT_BOOTSTRAP_EXPECT
value: ‘3’

resources:
limits:
cpu: ‘50’
memory: 500Gi
requests:
cpu: ‘50’
memory: 500Gi

etiennedi · August 19, 2024, 7:29am

Hi @wvuser,

Thanks for the report. You are correct, this is definitely unexpected. If a QUORUM can still be achieved and you only send your requests to nodes that report ready, there is no reason for a delay – especially not such a massive one.

I’ll ping a few folks from our DB Core team to help narrow this down. They will probably ask you a few more detailed questions about the setup.

Is the problem you are describing reproducible? If so a minimal reproducing example would be highly appreciated, as that can speed up a potential fix.

Best,
Etienne

Jose_Luis_Franco · August 19, 2024, 9:24am

Hello @wvuser,

This is José Luis, QA at Weaviate. Thanks a lot for reporting this issue. I did manage to reproduce it with the steps you provided.

The good news is that this issue has been already fixed by one of our core developers and it’s on its way to the next release 1.25.12 (I managed to reproduce it in 1.25.11, but couldn’t reproduce it on the candidate release for 1.25.12). So updating to the upcoming 1.25.12 (it will be released along the week) should get rid of the high latencies during pod restarts.

Once more, thanks for your feedback and for reporting the issue, it’s very valuable to us and the whole community.

Regards,
José Luis

wvuser · August 23, 2024, 5:27am

Hi!

Updating to v.1.25.12 fixed the problem, thanks. But there are some comments/questions:
This env. variables must be explicitly specified (1.25.6 works fine without them, other def.values?) for cluster’s fault tolerance:
DISABLE_LAZY_LOAD_SHARDS: ‘true’ (mandatory)
HNSW_STARTUP_WAIT_FOR_VECTOR_CACHE: ‘true’
Otherwise the system does not respond to any requests (infinite read timeout?). All pods quickly restarts sequentally (~1min), report ‘ready’ to k8s and begin shards/logs loading. But at this time (all at ‘hnsw_deserialization’ state?) requests hang (no response).

wvuser · August 28, 2024, 6:25am

Hello @Jose_Luis_Franco!

Is this fix available in version 1.26.1 (Jul 23)?
We have a problem with replica’s synchronization (not fixed ‘deletes’ when nodes failed/restarts) and manual ‘async-read-all’ not helps. We are thinking about upgrade to 1.26.

Topic		Replies	Views
High Query latency in Weaviate Support	13	458	October 1, 2024
Downtime in replicated two-node cluster when one node is restarting Support	12	459	May 8, 2025
Weaviate cluster is very unstable (1.29.2) Support	8	374	April 9, 2025
Node crashloop in K8 deployment Support	4	397	August 9, 2024
Whole cluster hangups on pod termination Support bug	3	203	September 19, 2024

Explosive growth (to 10sec) of request latency when one cluster's node fails

Description

Server Setup Information

Any additional Information

Related topics