Multi Node Weaviate EKS Cluster - Raft consensus data corrupted

Description

I am running a multi-node Weaviate (v1.26.3) deployment on EKS with 4 nodes. After performing infrastructure changes, object queries no longer return data, though schemas and tenants are visible.fqef

Steps performed before the issue:

  1. Scaled down the Weaviate StatefulSet to 0.
  2. Updated the EKS node group for Weaviate to use encrypted root volumes; nodes were replaced.
  3. Created snapshots of existing Weaviate data volumes in AWS, then created encrypted volumes. Deleted PVCs in the cluster and updated PVs to use the new encrypted volumes.
  4. Scaled up the StatefulSet to 4.
  5. Volumes attached correctly to pods.
  6. GET /schema and GET /schema//tenants return correct results.
  7. Any object queries return no data.
  8. Attempting to revert to old volumes did not resolve the issue.

Server Setup Information

  • Weaviate Server Version: 1.26.3
  • Deployment Method: EKS
  • Multi Node? Number of Running Nodes: 4
  • Client Language and Version: python, weaviate-client 4.14.4
  • Multitenancy?: yes

Any additional Information

kubectl logs weaviate-0 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error”
Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)
{“action”:“raft”,“backoff time”:10000000,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“dial tcp 10.0.2.59:8300: connect: no route to host”,“level”:“error”,“msg”:“raft failed to heartbeat to”,“peer”:“peer-ip”,“time”:“2025-12-17T16:28:50Z”}
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“dial tcp 10.0.2.59:8300: connect: no route to host”,“level”:“error”,“msg”:“raft failed to appendEntries to”,“peer”:{“Suffrage”:0,“ID”:“weaviate-1”,“Address”:“10.0.2.59:8300”},“time”:“2025-12-17T16:28:53Z”}
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“dial tcp peer-ip: connect: no route to host”,“level”:“error”,“msg”:“raft failed to make requestVote RPC”,“target”:{“Suffrage”:0,“ID”:“weaviate-1”,“Address”:“peer-ip”},“term”:475,“time”:“2025-12-17T16:28:53Z”} {“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“level”:“error”,“msg”:“raft peer has newer term, stopping replication”,“peer”:{“Suffrage”:0,“ID”:“weaviate-1”,“Address”:“10.0.2.53:8300”},“time”:“2025-12-17T16:28:53Z”}

kubectl logs weaviate-1 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error”
Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“log not found”,“last-index”:140008,“level”:“warning”,“msg”:“raft failed to get previous log”,“previous-index”:140014,“time”:“2025-12-17T16:28:56Z”}

kubectl logs weaviate-2 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error” Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)

kubectl logs weaviate-3 -n multi-node-weaviate --tail=100 | grep -i “corrupt|recover|error” Defaulted container “weaviate” out of: weaviate, configure-sysctl (init)
{“action”:“raft”,“build_git_commit”:“git-id”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.3”,“build_wv_version”:“1.26.3”,“error”:“log not found”,“last-index”:140008,“level”:“warning”,“msg”:“raft failed to get previous log”,“previous-index”:140011,“time”:“2025-12-17T16:28:44Z”}

Question / Request:

  • After scaling down, replacing nodes, and re-attaching volumes, the Raft state appears broken and object data is inaccessible.

  • Schema and tenants are still visible.

  • Is there a supported way to recover the object data from these existing PVs?

  • Should I rebuild the cluster from scratch using backup / export?

Any guidance on safe recovery of multi-node clusters after node replacement or volume migration would be appreciated.

Hi @jinx !!

1.26.3 is quite an old version, and even marked as broken in our releases:

So I believe your best bet here is to perform a snapshot, upgrade to at least 1.26.latest (for instance v1.26.18) and restart.

You should either migrate to 1.35.5 (1.27.latest → 1.28.latest → and so on) or copy over your data to a new cluster on latest version.

Let me know if this helps!

Hi @DudaNogueira, thanks for the response.

I have a couple of follow-up questions:

  1. Upgrade cadence
    How frequently should we expect major or breaking releases that require planned upgrades? Is there a recommended upgrade policy for long-running production clusters?

  2. Raft metadata behavior
    Just to confirm my understanding: in scenarios such as infrastructure node replacement where node identities change, Raft metadata (including shard ownership) can become stale and continue to reference nodes that no longer exist. In this situation, the cluster can still load schema and tenants, but object queries return empty results because no active node considers itself responsible for the shards. Is that an accurate characterization of what’s happening?

Thanks again.