Node Desync and Cluster Inconsistencies After OOM on Weaviate-0

Maxence_Oden · October 8, 2024, 8:47am

Hello,

We are running a Kubernetes cluster with 3 nodes. During a stress test, where 4 processes were performing dynamic batch imports on the weaviate-0 node, the node was OOM killed, which is not surprising. However, the main issue was during the node restoration process.

Few hours after weaviate-0 came back online, we observed a discrepancy in the number of objects in the collection compared to the other two nodes (weaviate-1 and weaviate-2). Attempts to query the collection resulted in the following errors:

POST objects:

{
    "error": [
        {
            "message": "put object: import into index documentationlocaldemo: replicate insertion: shard=\"u2an3dWDzMUF\": broadcast: cannot reach enough replicas"
        }
    ]
}

READ objects:

{
    "error": [
        {
            "message": "cannot achieve consistency level \"QUORUM\": read error" 
        }
    ]
}

v1/nodes on weaviate-0

{
    "error": [
        {
            "message": "node: weaviate-2: unexpected status code 401 ()"
        }
    ]
}

It seems that when weaviate-0 came back online, it did not have permission to communicate with the Raft cluster. We attempted to restart weaviate-0 without success. However, after restarting both weaviate-1 and weaviate-2, the cluster appeared to stabilize.

We got a desynchronization in the collection metadata across the nodes because of this weaviate-0 state. But the objects imported during the unstable state doesn’t exist in the check route.

893 objects for node 1 and 2, 709 for node 0

Unfortunately, there is now a desynchronization in the collection metadata objects across the nodes.

Issue:

Node weaviate-0 was OOM killed during stress testing.
Upon restoration, the node wasn’t able to communicate with the cluster.
Querying data results in replica or quorum errors and collection metadata desync.
Restarting the entire cluster temporarily resolved the communication issue, but there is still a desync in the collection.

Question:
How can we prevent the 401 issue when restarting a node to ensure proper cluster re-join and avoid collection metadata desynchronization?

Server Setup Information

Weaviate Server Version: Weaviate 1.25.19
Deployment Method: k8s
Multi Node? Number of Running Nodes: 3
Client Language and Version: python 1.6.7
Multitenancy?: No

Any additional Information

Collection config

"replicationConfig": {
     "factor": 3,
      "objectDeletionConflictResolution": "PermanentDeletion"
},
"shardingConfig": {
       "actualCount": 1,
       "actualVirtualCount": 128,
       "desiredCount": 1,
       "desiredVirtualCount": 128,
       "function": "murmur3",
       "key": "_id",
       "strategy": "hash",
       "virtualPerPhysical": 128
}

DudaNogueira · November 1, 2024, 2:49pm

hi @Maxence_Oden !!

Sorry for the delay here. Just found that missed some messages

I believe this was tackled in 1.26:

Do you still see this issue?

Thanks!

andrewisplinghoff · May 7, 2025, 11:39am

After deleting one of the Weaviate pods in our two replica cluster to test cluster resiliency (Weaviate version: 1.30.2), we are seeing the same error, e.g:
hashbeat iteration failed: collecting differences: "10.129.38.229:7001": status code: 401

When sending GET v1/cluster/statistics, we get one of the following responses:
{“error”:[{“message”:“node: weaviate-0: unexpected status code 401 ()”}]}
{“error”:[{“message”:“node: weaviate-1: unexpected status code 401 ()”}]}

See below for the log messages that appear to be most relevant around the time frame shortly after we removed the weaviate-0 pod (after it got started again automatically):

weaviate-0:

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“last-leader-addr”:“”,“last-leader-id”:“”,“level”:“warning”,“msg”:“heartbeat timeout reached, starting election”,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“info”,“msg”:“entering candidate state”,“node”:{},“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“id”:“weaviate-0”,“level”:“debug”,“msg”:“pre-voting for self”,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“debug”,“msg”:“calculated votes needed”,“needed”:1,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“from”:“weaviate-0”,“level”:“debug”,“msg”:“pre-vote received”,“tally”:0,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“from”:“weaviate-0”,“level”:“debug”,“msg”:“pre-vote granted”,“tally”:1,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“info”,“msg”:“pre-vote successful, starting election”,“refused”:0,“tally”:1,“term”:241,“time”:“2025-05-07T10:27:34Z”,“votesNeeded”:1}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“info”,“msg”:“attempting to join”,“remoteNodes”:{“weaviate-0”:“10.130.33.254:8300”},“time”:“2025-05-07T10:27:34Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“info”,“msg”:“attempted to join and failed”,“remoteNode”:“10.130.33.254:8300”,“status”:8,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“id”:“weaviate-0”,“level”:“debug”,“msg”:“voting for self”,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“from”:“weaviate-0”,“level”:“debug”,“msg”:“vote granted”,“tally”:1,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“info”,“msg”:“election won”,“tally”:1,“term”:241,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“leader”:{},“level”:“info”,“msg”:“entering leader state”,“time”:“2025-05-07T10:27:34Z”}

{“action”:“raft”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“info”,“msg”:“added peer, starting replication”,“peer”:“weaviate-1”,“time”:“2025-05-07T10:27:34Z”}
{“action”:“async_replication”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“class_name”:“FeedbackGC_v3”,“level”:“warning”,“msg”:"hashbeat iteration failed: collecting differences: "10.130.3.101:7001": status code: 401, error: ",“shard_name”:“odmAjZzBYSNB”,“time”:“2025-05-07T11:12:45Z”}

weaviate-1:

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.129.38.229:46354",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“error”,“msg”:" memberlist: Conflicting address for weaviate-0. Mine: 10.128.5.90:7000 Theirs: 10.129.38.229:7000 Old state: 2",“time”:“2025-05-07T10:24:47Z”}

{“action”:“raft-net”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“debug”,“local-address”:“10.130.3.101:8300”,“msg”:“accepted connection”,“remote-address”:“10.129.38.229:39512”,“time”:“2025-05-07T10:24:49Z”}

{“action”:“raft-net”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“level”:“debug”,“local-address”:“10.130.3.101:8300”,“msg”:“accepted connection”,“remote-address”:“10.129.38.229:39518”,“time”:“2025-05-07T10:24:49Z”}

{“action”:“async_replication”,“build_git_commit”:“80dac5a”,“build_go_version”:“go1.24.2”,“build_image_tag”:“v1.30.2”,“build_wv_version”:“1.30.2”,“class_name”:“FeedbackGC_v3”,“level”:“warning”,“msg”:"hashbeat iteration failed: collecting differences: "10.129.38.229:7001": status code: 401, error: ",“shard_name”:“exiR1xBmWBBO”,“time”:“2025-05-07T10:25:42Z”}

After a first look, I’m wondering if leader selection is properly being performed, from the log messages I can just see that weaviate-0 assumes to be the leader without hearing back from weaviate-1. Also, there seems to be some problem in authenticating communication between the pods (401 errors). Any hints on how to debug/how to fix?

DudaNogueira · May 7, 2025, 12:55pm

hi @andrewisplinghoff !!

From our knowledge base:

When you see the error “hashbeat iteration failed: collecting differences”, it typically indicates one of these situations:

The nodes are having communication issues between each other

There might be nodes that are down or having issues with their synchronization

If async replication is enabled, it could mean there are challenges in maintaining consistency across shards

In most cases, this error is a symptom rather than the root problem. If you’re seeing this error, you should:

Check if all nodes are up and running properly

Verify network connectivity between nodes

Consider checking the disk space and resource utilization

If you are using async replication, what can be happening is that the node could be overwhelmed with all the async replication operations at it’s bootstrap, and doesn’t have enough resources in order to answer for the RAFT elections.

On that case, you can try tweaking the ASYNC_ env vars, specially ASYNC_REPLICATION_PROPAGATION_CONCURRENCY in order to lower the ASYNC operations and leave some resource for RAFT.

Let me know if that helps!

Thanks!

andrewisplinghoff · May 7, 2025, 2:04pm

Hi @DudaNogueira,

thanks a lot for your quick reply. The error went away after I also deleted weaviate-1. Unfortunately, it is not clear to me, under which circumstances this problem happens exactly, it might be a race condition or sth. I’ve set ASYNC_REPLICATION_PROPAGATION_CONCURRENCY to 1 now to reduce network traffic, but telling if that actually has an influence is hard when the problem is not easily reproducible. But even with the cluster syncing successfully again after a pod failure, we are still seeing a temporary downtime of the overall service. I’ll create a separate ticket for that topic.

One quick question, is "numPeers": "0" normal or should that contain the number of pods or followers in the v1/cluster/statistics output?

{
“statistics”: [
{
“candidates”: {},
“dbLoaded”: true,
“initialLastAppliedIndex”: 1265,
“isVoter”: true,
“leaderAddress”: “10.128.34.44:8300”,
“leaderId”: “weaviate-0”,
“name”: “weaviate-0”,
“open”: true,
“raft”: {
“appliedIndex”: “1291”,
“commitIndex”: “1291”,
“fsmPending”: “0”,
“lastContact”: “0”,
“lastLogIndex”: “1291”,
“lastLogTerm”: “246”,
“lastSnapshotIndex”: “630”,
“lastSnapshotTerm”: “220”,
“latestConfiguration”: [
{
“address”: “10.128.34.44:8300”,
“id”: “weaviate-0”,
“suffrage”: 0
},
{
“address”: “10.128.6.133:8300”,
“id”: “weaviate-1”,
“suffrage”: 1
}
],
“latestConfigurationIndex”: “0”,
“numPeers”: “0”,
“protocolVersion”: “3”,
“protocolVersionMax”: “3”,
“protocolVersionMin”: “0”,
“snapshotVersionMax”: “1”,
“snapshotVersionMin”: “0”,
“state”: “Leader”,
“term”: “246”
},
“ready”: true,
“status”: “HEALTHY”
},
{
“candidates”: {},
“dbLoaded”: true,
“initialLastAppliedIndex”: 1265,
“leaderAddress”: “10.128.34.44:8300”,
“leaderId”: “weaviate-0”,
“name”: “weaviate-1”,
“open”: true,
“raft”: {
“appliedIndex”: “1291”,
“commitIndex”: “1291”,
“fsmPending”: “0”,
“lastContact”: “15.661266ms”,
“lastLogIndex”: “1291”,
“lastLogTerm”: “246”,
“lastSnapshotIndex”: “630”,
“lastSnapshotTerm”: “220”,
“latestConfiguration”: [
{
“address”: “10.128.34.44:8300”,
“id”: “weaviate-0”,
“suffrage”: 0
},
{
“address”: “10.128.6.133:8300”,
“id”: “weaviate-1”,
“suffrage”: 1
}
],
“latestConfigurationIndex”: “0”,
“numPeers”: “0”,
“protocolVersion”: “3”,
“protocolVersionMax”: “3”,
“protocolVersionMin”: “0”,
“snapshotVersionMax”: “1”,
“snapshotVersionMin”: “0”,
“state”: “Follower”,
“term”: “246”
},
“ready”: true,
“status”: “HEALTHY”
}
],
“synchronized”: true
}

andrewisplinghoff · May 8, 2025, 4:21pm

numPeers is now 2 for us after increasing our Weaviate cluster to three replicas as recommended for High Availability mode. Unfortunately I today again saw the issue described in this thread (error node: weaviate-0: unexpected status code 401 ()) also with ASYNC_REPLICATION_PROPAGATION_CONCURRENCY being set to 1. The cluster went out of sync after removing the Weaviate pod that was the leader. It did not reach synced status again also after waiting 10 mins. Only solution was to also restart the other pods. @DudaNogueira Do you have any ideas how to further analyze/debug this problem as it seems to be a bug in Weaviate?

andrewisplinghoff · May 12, 2025, 11:11am

Quick update, I found that the basic auth user and password in weaviate-cluster-api-basic-auth get changed by Helm whenever it deploys changes of the helm chart.

This check for an existing secret seems to not be working for us. Will investigate a bit more to see if I can find out why that is.

github.com/weaviate/weaviate-helm

weaviate/templates/_helpers.tpl

04367fda8


      
            - name: {{ . }}
              {{- end -}}
            {{- end -}}
          {{- end -}}
          
          
          {{/*
          Cluster API Secrets
          */}}
          {{- define "cluster_api.secret" -}}
          {{- $secret := lookup "v1" "Secret" .Release.Namespace "weaviate-cluster-api-basic-auth" -}}
          {{- if $secret -}}
          {{/*
             Reusing value of secret if exist
          */}}
          username: {{ $secret.data.username }}
          password: {{ $secret.data.password }}
          {{- else -}}
          {{/*
              add new data
          */}}

- EDIT / Update -
Understood a bit more what is going on.

We’re using ArgoCD for deployment which does not seem to support the lookup functionality.

Helm lookup Function Support · Issue #5202 · argoproj/argo-cd

DudaNogueira · May 12, 2025, 2:24pm

Oh! Thanks for sharing!

andrewisplinghoff · May 13, 2025, 9:44am

We fixed the issue by configuring ArgoCD to ignore differences in the secret, sth. like:

ignoreDifferences:
  - group: ""
    kind: Secret
    name: weaviate-cluster-api-basic-auth
    jsonPointers:
    - /data/username
    - /data/password

Topic		Replies	Views
Downtime in replicated two-node cluster when one node is restarting Support	12	850	May 8, 2025
Multi nodes running unnormally Support	5	767	December 3, 2024
Weaviate error "transferring leadership" on single node cluster Support integration , technical	12	985	December 21, 2024
Weaviate cluster is very unstable (1.29.2) Support	8	664	April 9, 2025
Production weaviate 24.6 crashed Support technical	1	642	March 8, 2025

Node Desync and Cluster Inconsistencies After OOM on Weaviate-0

Server Setup Information

Any additional Information

Related topics