Hello,
We are running a Kubernetes cluster with 3 nodes. During a stress test, where 4 processes were performing dynamic batch imports on the weaviate-0
node, the node was OOM killed, which is not surprising. However, the main issue was during the node restoration process.
Few hours after weaviate-0
came back online, we observed a discrepancy in the number of objects in the collection compared to the other two nodes (weaviate-1
and weaviate-2
). Attempts to query the collection resulted in the following errors:
POST objects:
{
"error": [
{
"message": "put object: import into index documentationlocaldemo: replicate insertion: shard=\"u2an3dWDzMUF\": broadcast: cannot reach enough replicas"
}
]
}
READ objects:
{
"error": [
{
"message": "cannot achieve consistency level \"QUORUM\": read error"
}
]
}
v1/nodes on weaviate-0
{
"error": [
{
"message": "node: weaviate-2: unexpected status code 401 ()"
}
]
}
It seems that when weaviate-0
came back online, it did not have permission to communicate with the Raft cluster. We attempted to restart weaviate-0
without success. However, after restarting both weaviate-1
and weaviate-2
, the cluster appeared to stabilize.
We got a desynchronization in the collection metadata across the nodes because of this weaviate-0 state. But the objects imported during the unstable state doesn’t exist in the check route.
893 objects for node 1 and 2, 709 for node 0
Unfortunately, there is now a desynchronization in the collection metadata objects across the nodes.
Issue:
- Node
weaviate-0
was OOM killed during stress testing. - Upon restoration, the node wasn’t able to communicate with the cluster.
- Querying data results in replica or quorum errors and collection metadata desync.
- Restarting the entire cluster temporarily resolved the communication issue, but there is still a desync in the collection.
Question:
How can we prevent the 401 issue when restarting a node to ensure proper cluster re-join and avoid collection metadata desynchronization?
Server Setup Information
- Weaviate Server Version: Weaviate 1.25.19
- Deployment Method: k8s
- Multi Node? Number of Running Nodes: 3
- Client Language and Version: python 1.6.7
- Multitenancy?: No
Any additional Information
Collection config
"replicationConfig": {
"factor": 3,
"objectDeletionConflictResolution": "PermanentDeletion"
},
"shardingConfig": {
"actualCount": 1,
"actualVirtualCount": 128,
"desiredCount": 1,
"desiredVirtualCount": 128,
"function": "murmur3",
"key": "_id",
"strategy": "hash",
"virtualPerPhysical": 128
}
Related: