Node Desync and Cluster Inconsistencies After OOM on Weaviate-0

Hello,

We are running a Kubernetes cluster with 3 nodes. During a stress test, where 4 processes were performing dynamic batch imports on the weaviate-0 node, the node was OOM killed, which is not surprising. However, the main issue was during the node restoration process.

Few hours after weaviate-0 came back online, we observed a discrepancy in the number of objects in the collection compared to the other two nodes (weaviate-1 and weaviate-2). Attempts to query the collection resulted in the following errors:

POST objects:

{
    "error": [
        {
            "message": "put object: import into index documentationlocaldemo: replicate insertion: shard=\"u2an3dWDzMUF\": broadcast: cannot reach enough replicas"
        }
    ]
}

READ objects:

{
    "error": [
        {
            "message": "cannot achieve consistency level \"QUORUM\": read error" 
        }
    ]
}

v1/nodes on weaviate-0

{
    "error": [
        {
            "message": "node: weaviate-2: unexpected status code 401 ()"
        }
    ]
}

It seems that when weaviate-0 came back online, it did not have permission to communicate with the Raft cluster. We attempted to restart weaviate-0 without success. However, after restarting both weaviate-1 and weaviate-2, the cluster appeared to stabilize.

We got a desynchronization in the collection metadata across the nodes because of this weaviate-0 state. But the objects imported during the unstable state doesn’t exist in the check route.


893 objects for node 1 and 2, 709 for node 0

Unfortunately, there is now a desynchronization in the collection metadata objects across the nodes.

Issue:

  • Node weaviate-0 was OOM killed during stress testing.
  • Upon restoration, the node wasn’t able to communicate with the cluster.
  • Querying data results in replica or quorum errors and collection metadata desync.
  • Restarting the entire cluster temporarily resolved the communication issue, but there is still a desync in the collection.

Question:
How can we prevent the 401 issue when restarting a node to ensure proper cluster re-join and avoid collection metadata desynchronization?

Server Setup Information

  • Weaviate Server Version: Weaviate 1.25.19
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 3
  • Client Language and Version: python 1.6.7
  • Multitenancy?: No

Any additional Information

Collection config

"replicationConfig": {
     "factor": 3,
      "objectDeletionConflictResolution": "PermanentDeletion"
},
"shardingConfig": {
       "actualCount": 1,
       "actualVirtualCount": 128,
       "desiredCount": 1,
       "desiredVirtualCount": 128,
       "function": "murmur3",
       "key": "_id",
       "strategy": "hash",
       "virtualPerPhysical": 128
}

Related: