Multi Node Weaviate EKS Cluster - Raft consensus data corrupted

jinx · December 17, 2025, 5:32pm

Description

I am running a multi-node Weaviate (v1.26.3) deployment on EKS with 4 nodes. After performing infrastructure changes, object queries no longer return data, though schemas and tenants are visible.fqef

Steps performed before the issue:

Scaled down the Weaviate StatefulSet to 0.
Updated the EKS node group for Weaviate to use encrypted root volumes; nodes were replaced.
Created snapshots of existing Weaviate data volumes in AWS, then created encrypted volumes. Deleted PVCs in the cluster and updated PVs to use the new encrypted volumes.
Scaled up the StatefulSet to 4.
Volumes attached correctly to pods.
GET /schema and GET /schema//tenants return correct results.
Any object queries return no data.
Attempting to revert to old volumes did not resolve the issue.

Server Setup Information

Weaviate Server Version: 1.26.3
Deployment Method: EKS
Multi Node? Number of Running Nodes: 4
Client Language and Version: python, weaviate-client 4.14.4
Multitenancy?: yes

Any additional Information

kubectl logs weaviate-0 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error”
Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)
{“action”:“raft”,“backoff time”:10000000,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“dial tcp 10.0.2.59:8300: connect: no route to host”,“level”:“error”,“msg”:“raft failed to heartbeat to”,“peer”:“peer-ip”,“time”:“2025-12-17T16:28:50Z”}
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“dial tcp 10.0.2.59:8300: connect: no route to host”,“level”:“error”,“msg”:“raft failed to appendEntries to”,“peer”:{“Suffrage”:0,“ID”:“weaviate-1”,“Address”:“10.0.2.59:8300”},“time”:“2025-12-17T16:28:53Z”}
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“dial tcp peer-ip: connect: no route to host”,“level”:“error”,“msg”:“raft failed to make requestVote RPC”,“target”:{“Suffrage”:0,“ID”:“weaviate-1”,“Address”:“peer-ip”},“term”:475,“time”:“2025-12-17T16:28:53Z”} {“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“level”:“error”,“msg”:“raft peer has newer term, stopping replication”,“peer”:{“Suffrage”:0,“ID”:“weaviate-1”,“Address”:“10.0.2.53:8300”},“time”:“2025-12-17T16:28:53Z”}

kubectl logs weaviate-1 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error”
Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“log not found”,“last-index”:140008,“level”:“warning”,“msg”:“raft failed to get previous log”,“previous-index”:140014,“time”:“2025-12-17T16:28:56Z”}

kubectl logs weaviate-2 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error” Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)

kubectl logs weaviate-3 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error” Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“log not found”,“last-index”:140008,“level”:“warning”,“msg”:“raft failed to get previous log”,“previous-index”:140011,“time”:“2025-12-17T16:28:44Z”}

Question / Request:

After scaling down, replacing nodes, and re-attaching volumes, the Raft state appears broken and object data is inaccessible.
Schema and tenants are still visible.
Is there a supported way to recover the object data from these existing PVs?
Should I rebuild the cluster from scratch using backup / export?

Any guidance on safe recovery of multi-node clusters after node replacement or volume migration would be appreciated.

DudaNogueira · December 17, 2025, 8:55pm

Hi @jinx !!

1.26.3 is quite an old version, and even marked as broken in our releases:

So I believe your best bet here is to perform a snapshot, upgrade to at least 1.26.latest (for instance v1.26.18) and restart.

You should either migrate to 1.35.5 (1.27.latest → 1.28.latest → and so on) or copy over your data to a new cluster on latest version.

Let me know if this helps!

jinx · December 18, 2025, 6:52am

Hi @DudaNogueira, thanks for the response.

I have a couple of follow-up questions:

Upgrade cadence
How frequently should we expect major or breaking releases that require planned upgrades? Is there a recommended upgrade policy for long-running production clusters?
Raft metadata behavior
Just to confirm my understanding: in scenarios such as infrastructure node replacement where node identities change, Raft metadata (including shard ownership) can become stale and continue to reference nodes that no longer exist. In this situation, the cluster can still load schema and tenants, but object queries return empty results because no active node considers itself responsible for the shards. Is that an accurate characterization of what’s happening?

Thanks again.

Topic		Replies	Views
EKS Multi-Replica Raft Migration Failure in Weaviate 1.25.0 Support k8s	1	55	October 14, 2025
Node crashloop in K8 deployment Support	4	805	August 9, 2024
Production weaviate 24.6 crashed Support technical	1	741	March 8, 2025
Unable to restart my weaviate container Support	8	2303	May 20, 2024
Downtime in replicated two-node cluster when one node is restarting Support	12	1066	May 8, 2025

Multi Node Weaviate EKS Cluster - Raft consensus data corrupted

Description

Server Setup Information

Any additional Information

Related topics