Downtime in replicated two-node cluster when one node is restarting

Description

We recently switched our Weaviate database from a single node to a replicated two-node cluster. Idea behind this is to achieve high-availablity, i.e. zero-downtime upgrades and configuration changes. Unfortunately, so far, we are always seeing a downtime (~1 min) if one of the pods gets restarted (e.g. we remove it for testing).

Server Setup Information

  • Weaviate Server Version: 1.30.2
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 2
  • Multitenancy?: no

Any additional Information

Shortly after removing weaviate-1, also the other pod weaviate-0 gets marked as unready by Kubernetes, which consequently leads to downtime of the whole service as it has not any pods marked as ready.

OpenShift reports the following problem (default probe as in the helm chart is used: /v1/.well-known/ready):

Readiness probe failed: HTTP probe failed with statuscode: 503

Here is the start of the log from weaviate-0 when downtime started (I’ll post the rest in a separate message, maximum message size reached):

{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Initiating push/pull sync with: weaviate-1 10.128.16.144:7000","time":"2025-05-07T15:02:57Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","command":3,"level":"info","msg":"updating configuration","server-addr":"","server-id":"weaviate-1","servers":"[[{Suffrage:Voter ID:weaviate-0 Address:10.129.39.158:8300}]]","time":"2025-05-07T15:03:07Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","last-index":1311,"level":"info","msg":"removed peer, stopping replication","peer":"weaviate-1","time":"2025-05-07T15:03:07Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":"aborting pipeline replication","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.16.144:8300"},"time":"2025-05-07T15:03:07Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Failed UDP ping: weaviate-1 (timeout reached)","time":"2025-05-07T15:03:08Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":" memberlist: Suspect weaviate-1 has failed, no acks received","time":"2025-05-07T15:03:09Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Failed UDP ping: weaviate-1 (timeout reached)","time":"2025-05-07T15:03:09Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":" memberlist: Suspect weaviate-1 has failed, no acks received","time":"2025-05-07T15:03:11Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Failed UDP ping: weaviate-1 (timeout reached)","time":"2025-05-07T15:03:12Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":" memberlist: Marking weaviate-1 as failed, suspect timeout reached (0 peer confirmations)","time":"2025-05-07T15:03:13Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":" memberlist: Suspect weaviate-1 has failed, no acks received","time":"2025-05-07T15:03:15Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Stream connection from=10.128.34.52:33362","time":"2025-05-07T15:03:17Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:17Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"error","msg":" memberlist: Conflicting address for weaviate-1. Mine: 10.128.16.144:7000 Theirs: 10.128.34.52:7000 Old state: 2","time":"2025-05-07T15:03:18Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","command":1,"level":"info","msg":"updating configuration","server-addr":"10.128.34.52:8300","server-id":"weaviate-1","servers":"[[{Suffrage:Voter ID:weaviate-0 Address:10.129.39.158:8300} {Suffrage:Nonvoter ID:weaviate-1 Address:10.128.34.52:8300}]]","time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":"added peer, starting replication","peer":"weaviate-1","time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","backoff time":10000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","backoff time":10000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","backoff time":10000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","backoff time":20000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:19Z"}
{"action":"raft","backoff time":40000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:19Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:19Z"}
{"action":"raft","backoff time":80000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:20Z"}
{"action":"raft","backoff time":160000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:20Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:20Z"}
{"action":"raft","backoff time":320000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:20Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:21Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:21Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:21Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:22Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:22Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:23Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:23Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:24Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:25Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:25Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:25Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:26Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:27Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:27Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:28Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:29Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:29Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:30Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:30Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:30Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:31Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:32Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:32Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:33Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:34Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:35Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:35Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:36Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:36Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:37Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:38Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:38Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:39Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:40Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:40Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:40Z"}

I’ll try disabling async replication on the server side as I have the feeling as if that process might have sth. to do with the problem we are seeing.

Here’s the rest of the weaviate-0 log during the downtime until it became available again:

{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:41Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:42Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:42Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:43Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:44Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:44Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"FeedbackGC_v3","hosts":["10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"exiR1xBmWBBO","time":"2025-05-07T15:03:45Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:45Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:45Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"FeedbackGC_v3","hosts":["10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"odmAjZzBYSNB","time":"2025-05-07T15:03:46Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:46Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"Page_v10","hosts":["10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"4rHKzeRZC3SG","time":"2025-05-07T15:03:46Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:47Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"OnboardingStatus_v2","hosts":["10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"uj270iEN9oiL","time":"2025-05-07T15:03:47Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"PageNode_v10","hosts":["10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"9KkF0kFnsVUH","time":"2025-05-07T15:03:47Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:47Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:48Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:49Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:49Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:50Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:03:50Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:51Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:51Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:52Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:53Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"OnboardingStatus_v2","hosts":["10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"Sl7ubxfYpHrz","time":"2025-05-07T15:03:53Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:53Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:54Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:55Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:55Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:56Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:57Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:57Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:58Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:59Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:03:59Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:00Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:01Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to appendEntries to","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:04:01Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:01Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:02Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:03Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:03Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:04Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:04Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:05Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:06Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","error":"dial tcp: address 99999999: invalid port","level":"error","msg":"failed to heartbeat to","peer":"10.128.34.52:8300","time":"2025-05-07T15:04:06Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Stream connection from=10.128.34.52:35086","time":"2025-05-07T15:04:07Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"warning","msg":"appendEntries rejected, sending older logs","next":1312,"peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:04:11Z"}
{"action":"raft","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"info","msg":"pipelining replication","peer":{"Suffrage":1,"ID":"weaviate-1","Address":"10.128.34.52:8300"},"time":"2025-05-07T15:04:11Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"BannerMessage_v2","level":"warning","msg":"hashbeat iteration failed: collecting differences: \"10.128.34.52:7001\": status code: 500, error: hashtree level: local index \"BannerMessage_v2\" not found\n: context deadline exceeded","shard_name":"9yblTdorKgQO","time":"2025-05-07T15:04:17Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"BannerMessage_v2","level":"warning","msg":"hashbeat iteration failed: collecting differences: \"10.128.34.52:7001\": status code: 500, error: hashtree level: local index \"BannerMessage_v2\" not found\n: context deadline exceeded","shard_name":"acFwc8LnjV4E","time":"2025-05-07T15:04:18Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"Page_v10","level":"warning","msg":"hashbeat iteration failed: collecting differences: \"10.128.34.52:7001\": status code: 500, error: hashtree level: local index \"Page_v10\" not found\n: context deadline exceeded","shard_name":"3dl9aCV1Kdxe","time":"2025-05-07T15:04:18Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"PageNode_v10","level":"warning","msg":"hashbeat iteration failed: collecting differences: \"10.128.34.52:7001\": status code: 500, error: hashtree level: local index \"PageNode_v10\" not found\n: context deadline exceeded","shard_name":"W6EfEjOg1Kxa","time":"2025-05-07T15:04:19Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Initiating push/pull sync with: weaviate-1 10.128.34.52:7000","time":"2025-05-07T15:04:27Z"}
{"build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","level":"debug","msg":" memberlist: Stream connection from=10.128.34.52:53458","time":"2025-05-07T15:04:37Z"}
{"action":"async_replication","build_git_commit":"80dac5a","build_go_version":"go1.24.2","build_image_tag":"v1.30.2","build_wv_version":"1.30.2","class_name":"Page_v10","hosts":["10.128.34.52:7001","10.129.39.158:7001"],"level":"info","msg":"hashbeat iteration successfully completed: no differences were found","shard_name":"4rHKzeRZC3SG","time":"2025-05-07T15:04:46Z"}

Wondering if this is a consequence of us using only two replicas with replication factor two instead of three replicas with replication factor three?

In a 2-node setup, on the other hand, no node failures can be tolerated while still reaching consensus across nodes.

Cluster Architecture | Weaviate

What exactly does this mean, is it just about data writes? Shouldn’t a two-replica setup stay available when one node fails - at least concerning accessing the data that was already present before the node failure?

Hey @andrewisplinghoff, How’s it going? :hugs:

The issue you’re experiencing with your two-node Weaviate cluster make sense and relates to how Weaviate uses the Raft consensus protocol. In a 2-node setup, no node failures can be tolerated while still maintaining consensus. When one of the nodes goes down, the remaining node cannot form a quorum (majority), causing the cluster to become unavailable.

In Raft, a quorum requires a majority of nodes. For a 2-node cluster, that means both nodes must be up — quorum is 2 of 2. When one is down, there’s no quorum, and operations behave unexpected.

To ensure high availability and support for zero-downtime upgrades, I strongly recommend increasing your cluster to at least 3 nodes. In that setup, quorum is 2 of 3, so one node can go down without impacting availability.

Try this Weaviate Cluster community tool I built to visualize cluster status and Raft stats:

e.g.

I would say:

  1. Re-create your cluster with at least 3 nodes or more nodes (5, or 7 are common for large operation).
  2. Ensure replication factor aligns with your fault tolerance goals (typically equals the number of nodes).

Let me know if you have more questions or need hand with anything.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

Thanks a lot @Mohamed_Shahin, readiness is looking better with three replicas. Is it normal that curl weaviate/v1/cluster/statistics fails while a rolling update is being performed? I saw the following errors:

{"error":[{"message":"node: weaviate-0: unexpected status code 401 ()"}]}
{"error":[{"message":"node: weaviate-0: send http request: Get \"http://10.131.38.102:7001/nodes/statistics\": dial tcp 10.131.38.102:7001: connect: connection refused"}]}

@Mohamed_Shahin I found the following statement in the Weaviate blog, is it incorrect?

And for more cost-sensitive applications, even 2 would introduce high availability and robustness to the system.

Achieve Zero-Downtime Upgrades with Weaviate’s Multi-Node Setup | Weaviate

I tried performing a rolling update while performing data ingestion with the 3 node cluster. Unfortunately, the process aborted, because batch insertion API calls failed with the following errors:

{'error': [{'message': 'resolve node name "weaviate-2" to host'}]}
[{'message': 'connect: Post "http://10.130.2.194:7001/indices/PageNode_v10/shards/u0uTLFb4fccR/objects?schema_version=0": dial tcp 10.130.2.194:7001: i/o timeout'}]}

Any ideas?

I’m glad to hear everything is okay now. Yeah, the pods should be up and ready so you can read the statistics in RAFT and communication.

I think what the statement in the blog was trying to saying is that while having three nodes is optimal for both consistency and availability, having two can still offer some availability benefits compared to no replication at all—though it does compromise consistency during certain failure scenarios.

It looks like there’s a connection timeout error when trying to reach the Weaviate at the pod with the specified IP address. There might be something with the Pod

Are RAFT statistics running fine, like in the example I’m sharing here?


When you added the Pod, you just scaled rather than re-create the cluster, am I right?

@Mohamed_Shahin Yes, I just updated the number of replicas in the Helm configuration, I kept the data from the PVCs that was in the cluster.

While writing to Weaviate via Batch Import, we are not configuring consistency_level, so it should use the default QUORUM setting. I would expect this setting to work correctly with a rolling update where always two out of three replicas are ready.

Cluster status is looking OK according to weaviate/v1/cluster/statistics (I don’t have your nice UI, though):

  "statistics": [
    {
      "candidates": {},
      "dbLoaded": true,
      "initialLastAppliedIndex": 1361,
      "isVoter": true,
      "leaderAddress": "10.131.5.3:8300",
      "leaderId": "weaviate-2",
      "name": "weaviate-0",
      "open": true,
      "raft": {
        "appliedIndex": "1366",
        "commitIndex": "1366",
        "fsmPending": "0",
        "lastContact": "6.833907ms",
        "lastLogIndex": "1366",
        "lastLogTerm": "253",
        "lastSnapshotIndex": "630",
        "lastSnapshotTerm": "220",
        "latestConfiguration": [
          {
            "address": "10.129.40.222:8300",
            "id": "weaviate-0",
            "suffrage": 0
          },
          {
            "address": "10.131.5.3:8300",
            "id": "weaviate-2",
            "suffrage": 0
          },
          {
            "address": "10.131.3.50:8300",
            "id": "weaviate-1",
            "suffrage": 0
          }
        ],
        "latestConfigurationIndex": "0",
        "numPeers": "2",
        "protocolVersion": "3",
        "protocolVersionMax": "3",
        "protocolVersionMin": "0",
        "snapshotVersionMax": "1",
        "snapshotVersionMin": "0",
        "state": "Follower",
        "term": "253"
      },
      "ready": true,
      "status": "HEALTHY"
    },
    {
      "candidates": {},
      "dbLoaded": true,
      "initialLastAppliedIndex": 1361,
      "isVoter": true,
      "leaderAddress": "10.131.5.3:8300",
      "leaderId": "weaviate-2",
      "name": "weaviate-1",
      "open": true,
      "raft": {
        "appliedIndex": "1366",
        "commitIndex": "1366",
        "fsmPending": "0",
        "lastContact": "29.274184ms",
        "lastLogIndex": "1366",
        "lastLogTerm": "253",
        "lastSnapshotIndex": "630",
        "lastSnapshotTerm": "220",
        "latestConfiguration": [
          {
            "address": "10.129.40.222:8300",
            "id": "weaviate-0",
            "suffrage": 0
          },
          {
            "address": "10.131.5.3:8300",
            "id": "weaviate-2",
            "suffrage": 0
          },
          {
            "address": "10.131.3.50:8300",
            "id": "weaviate-1",
            "suffrage": 0
          }
        ],
        "latestConfigurationIndex": "0",
        "numPeers": "2",
        "protocolVersion": "3",
        "protocolVersionMax": "3",
        "protocolVersionMin": "0",
        "snapshotVersionMax": "1",
        "snapshotVersionMin": "0",
        "state": "Follower",
        "term": "253"
      },
      "ready": true,
      "status": "HEALTHY"
    },
    {
      "candidates": {},
      "dbLoaded": true,
      "initialLastAppliedIndex": 1361,
      "isVoter": true,
      "leaderAddress": "10.131.5.3:8300",
      "leaderId": "weaviate-2",
      "name": "weaviate-2",
      "open": true,
      "raft": {
        "appliedIndex": "1366",
        "commitIndex": "1366",
        "fsmPending": "0",
        "lastContact": "0",
        "lastLogIndex": "1366",
        "lastLogTerm": "253",
        "lastSnapshotIndex": "630",
        "lastSnapshotTerm": "220",
        "latestConfiguration": [
          {
            "address": "10.129.40.222:8300",
            "id": "weaviate-0",
            "suffrage": 0
          },
          {
            "address": "10.131.5.3:8300",
            "id": "weaviate-2",
            "suffrage": 0
          },
          {
            "address": "10.131.3.50:8300",
            "id": "weaviate-1",
            "suffrage": 0
          }
        ],
        "latestConfigurationIndex": "0",
        "numPeers": "2",
        "protocolVersion": "3",
        "protocolVersionMax": "3",
        "protocolVersionMin": "0",
        "snapshotVersionMax": "1",
        "snapshotVersionMin": "0",
        "state": "Leader",
        "term": "253"
      },
      "ready": true,
      "status": "HEALTHY"
    }
  ],
  "synchronized": true
}```

Hm seems like the collection is incorrectly configured concerning replication, that would explain the problem I think:

# curl weaviate/v1/schema
....
"replicationConfig": {
        "asyncEnabled": false,
        "deletionStrategy": "NoAutomatedResolution",
        "factor": 1
      },
...

I will recreate the collection and make sure it is correctly configured.

Yes, I mentioned the node scaling because currently, we don’t support shard movement or redistribution across nodes — although this is something we are actively working on. As a result, when you add new nodes, existing collections are not automatically replicated to them. However, any new collections created after adding nodes will be properly distributed.

So yes, when you update the replication factor and recreate the collection, it will work as expected.

No need to worry too much about the UI — I only referred to it as a way to confirm that everything is in sync and that the statistics look correct overall.

Thanks @Mohamed_Shahin, writing is working reliably now with the three node setup and replication enabled also when one of the nodes is not available. Only issue remaining for us is this one here: Node Desync and Cluster Inconsistencies After OOM on Weaviate-0 - Support - Weaviate Community Forum

Awesome news! thank you for confirming that.

I have not seen this thread yet! I will need to read and follow up as soon as I can.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)