Node crashloop in K8 deployment

Description

I upgraded my Weaviate version to 1.26.1, but am observing all three of my Weaviate replicas in crash loops despite having a rolling update of 1; they all are crashing with the same error, though obviously the peer ids change depending on the node I’m looking at.

{"action":"raft","level":"info","msg":"raft entering candidate state","node":{},"term":792,"time":"2024-07-31T23:10:31Z"}
{"action":"raft-net","error":"could not resolve server id weaviate-2","fallback":"10.0.20.103:8300","id":"weaviate-2","level":"warning","msg":"raft-net unable to get address for server, using fallback address","time":"2024-07-31T23:10:31Z"}
{"action":"raft-net","error":"could not resolve server id weaviate-1","fallback":"10.0.3.69:8300","id":"weaviate-1","level":"warning","msg":"raft-net unable to get address for server, using fallback address","time":"2024-07-31T23:10:31Z"}
{"action":"raft","error":"dial tcp 10.0.20.103:8300: connect: connection refused","level":"error","msg":"raft failed to make requestVote RPC","target":{"Suffrage":0,"ID":"weaviate-2","Address":"10.0.20.103:8300"},"term":792,"time":"2024-07-31T23:10:31Z"}
{"action":"raft","error":"dial tcp 10.0.3.69:8300: connect: connection refused","level":"error","msg":"raft failed to make requestVote RPC","target":{"Suffrage":0,"ID":"weaviate-1","Address":"10.0.3.69:8300"},"term":792,"time":"2024-07-31T23:10:31Z"}
{"level":"warning","msg":"raft Election timeout reached, restarting election","time":"2024-07-31T23:10:32Z"}

update: i deleted all the pods to see if recreating all together would solve the issue, but now i’m seeing a different error in all of them. The first 2 lines are repeated thousands of times (I assume corresponding with every object I had in every collection).

{"action":"lsm_recover_from_active_wal","class":"Chunk_89043da9b84a4a8a9ef299c9f608602a","index":"chunk_89043da9b84a4a8a9ef299c9f608602a","level":"warning","msg":"empty write-ahead-log found. Did weaviate crash prior to this or the tenant on/loaded from the cloud? Nothing to recover from this file.","path":"/var/lib/weaviate/chunk_89043da9b84a4a8a9ef299c9f608602a/J0VmCeiHB40Q/lsm/property__lastUpdateTimeUnix/segment-1722465036274445156","shard":"J0VmCeiHB40Q","time":"2024-07-31T23:26:07Z"}
{"action":"lsm_recover_from_active_wal","class":"Chunk_89043da9b84a4a8a9ef299c9f608602a","index":"chunk_89043da9b84a4a8a9ef299c9f608602a","level":"warning","msg":"empty write-ahead-log found. Did weaviate crash prior to this or the tenant on/loaded from the cloud? Nothing to recover from this file.","path":"/var/lib/weaviate/chunk_89043da9b84a4a8a9ef299c9f608602a/J0VmCeiHB40Q/lsm/property__lastUpdateTimeUnix/segment-1722468246818239897","shard":"J0VmCeiHB40Q","time":"2024-07-31T23:26:07Z"}
{"action":"hnsw_prefill_cache_async","level":"info","msg":"not waiting for vector cache prefill, running in background","time":"2024-07-31T23:26:07Z","wait_for_cache_prefill":false}
{"level":"info","msg":"Completed loading shard chunk_89043da9b84a4a8a9ef299c9f608602a_J0VmCeiHB40Q in 12.150606ms","time":"2024-07-31T23:26:07Z"}
{"checkpoint":0,"component":"index_queue","count":0,"last_stored_id":1,"level":"info","msg":"enqueued vectors from last indexed checkpoint","shard_id":"chunk_89043da9b84a4a8a9ef299c9f608602a_J0VmCeiHB40Q","target_vector":"","time":"2024-07-31T23:26:07Z","took":58961}
{"action":"hnsw_vector_cache_prefill","count":3000,"index_id":"main","level":"info","limit":1000000000000,"msg":"prefilled vector cache","time":"2024-07-31T23:26:07Z","took":369480}
{"action":"bootstrap","error":"could not join a cluster from [10.0.5.66:8300 10.0.28.8:8300 10.0.27.24:8300]","level":"warning","msg":"failed to join cluster, will notify next if voter","servers":["10.0.5.66:8300","10.0.28.8:8300","10.0.27.24:8300"],"time":"2024-07-31T23:26:07Z","voter":true}
{"action":"bootstrap","error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded","level":"error","msg":"notify all peers","servers":["10.0.5.66:8300","10.0.28.8:8300","10.0.27.24:8300"],"time":"2024-07-31T23:26:07Z"}
{"action":"startup","error":"bootstrap: context deadline exceeded","level":"fatal","msg":"could not open cloud meta store","time":"2024-07-31T23:26:07Z"}

Once they hit the last line, after ~30s or so, the pods get killed and it restart:

{"action":"config_load","config_file_path":"/weaviate-config/conf.yaml","level":"info","msg":"Usage of the weaviate.conf.json file is deprecated and will be removed in the future. Please use environment variables.","time":"2024-07-31T23:33:48Z"}
{"deprecation":{"apiType":"Configuration","id":"config-files","locations":["--config-file=\"\""],"mitigation":"Configure Weaviate using environment variables.","msg":"use of deprecated command line argument --config-file","sinceTime":"2020-09-08T09:46:00.000Z","sinceVersion":"0.22.16","status":"deprecated"},"level":"warning","msg":"use of deprecated command line argument --config-file","time":"2024-07-31T23:33:48Z"}
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-07-31T23:33:48Z"}
{"action":"startup","auto_schema_enabled":false,"level":"info","msg":"auto schema enabled setting is set to \"false\"","time":"2024-07-31T23:33:48Z"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-07-31T23:33:48Z"}
{"level":"info","msg":"async indexing enabled","time":"2024-07-31T23:33:48Z"}
{"level":"info","msg":"module offload-s3 is enabled","time":"2024-07-31T23:33:48Z"}
{"level":"info","msg":"open cluster service","servers":{"weaviate-0":8300,"weaviate-1":8300,"weaviate-2":8300},"time":"2024-07-31T23:33:50Z"}
{"address":"10.0.5.66:8301","level":"info","msg":"starting cloud rpc server ...","time":"2024-07-31T23:33:50Z"}
{"level":"info","msg":"starting raft sub-system ...","time":"2024-07-31T23:33:50Z"}
{"address":"10.0.5.66:8300","level":"info","msg":"tcp transport","tcpMaxPool":3,"tcpTimeout":10000000000,"time":"2024-07-31T23:33:50Z"}
{"level":"info","msg":"loading local db","time":"2024-07-31T23:33:50Z"}
{"level":"info","msg":"local DB successfully loaded","time":"2024-07-31T23:33:50Z"}
{"level":"info","msg":"schema manager loaded","n":0,"time":"2024-07-31T23:33:50Z"}
{"level":"info","metadata_only_voters":false,"msg":"construct a new raft node","name":"weaviate-0","time":"2024-07-31T23:33:50Z"}
{"action":"raft","index":616,"level":"info","msg":"raft initial configuration","servers":"[[{Suffrage:Voter ID:weaviate-2 Address:10.0.27.24:8300} {Suffrage:Voter ID:weaviate-1 Address:10.0.28.8:8300} {Suffrage:Voter ID:weaviate-0 Address:10.0.5.66:8300}]]","time":"2024-07-31T23:33:50Z"}
{"last_snapshot_index":0,"last_store_applied_index":0,"last_store_log_applied_index":553,"level":"info","msg":"raft node constructed","raft_applied_index":0,"raft_last_index":616,"time":"2024-07-31T23:33:50Z"}
{"action":"raft","follower":{},"leader-address":"","leader-id":"","level":"info","msg":"raft entering follower state","time":"2024-07-31T23:33:50Z"}
{"action":"raft","last-leader-addr":"","last-leader-id":"","level":"warning","msg":"raft heartbeat timeout reached, starting election","time":"2024-07-31T23:33:52Z"}
{"action":"raft","level":"info","msg":"raft entering candidate state","node":{},"term":917,"time":"2024-07-31T23:33:52Z"}
{"docker_image_tag":"1.26.1","level":"info","msg":"configured versions","server_version":"1.26.1","time":"2024-07-31T23:33:52Z"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2024-07-31T23:33:52Z"}
{"action":"restapi_management","docker_image_tag":"1.26.1","level":"info","msg":"Serving weaviate at http://[::]:8080","time":"2024-07-31T23:33:52Z"}
{"level":"warning","msg":"raft Election timeout reached, restarting election","time":"2024-07-31T23:33:53Z"}
{"action":"raft","level":"info","msg":"raft entering candidate state","node":{},"term":918,"time":"2024-07-31T23:33:53Z"}
{"level":"warning","msg":"raft Election timeout reached, restarting election","time":"2024-07-31T23:33:54Z"}
{"action":"raft","level":"info","msg":"raft entering candidate state","node":{},"term":919,"time":"2024-07-31T23:33:54Z"}
{"level":"warning","msg":"raft Election timeout reached, restarting election","time":"2024-07-31T23:33:56Z"}
{"action":"raft","level":"info","msg":"raft entering candidate state","node":{},"term":920,"time":"2024-07-31T23:33:56Z"}
{"action":"raft","follower":{},"leader-address":"","leader-id":"","level":"info","msg":"raft entering follower state","time":"2024-07-31T23:33:57Z"}
{"action":"raft","error":"log not found","last-index":616,"level":"warning","msg":"raft failed to get previous log","previous-index":620,"time":"2024-07-31T23:33:57Z"}
{"address":"10.0.28.8:8300","level":"info","msg":"current Leader","time":"2024-07-31T23:33:58Z"}

Server Setup Information

  • Weaviate Server Version: 1.26.1
  • Deployment Method: Kubernetes
  • Multi Node? Number of Running Nodes: 3
  • Client Language and Version:
  • Multitenancy?: No

Hi!

From which version you upgraded from?

There is some specific instructions for migrating from 1.24+ and onwards. Let me know if you have followed this:

Thanks!

I upgraded from 1.25.8, which was the initial version of the cluster.

Considering the first logs, it seems that the nodes were not able to communicate between them.

the first log is probably from weaviate-0. Probably this was a network issue that could be solved with the new pods :person_shrugging:

about the second log, the first two is Weaviate try to restore from previous crash. Nothing to worry.

the last ones are the ones to look. Seems it again cannot join the cluster.

How many objects are there? One possibility is that, it doesn’t have enough resources to spin up the nodes.

Do you see any outstanding k8s events?

I have since resolved this by deleting all the persistent volumes & claims attached to my replicas and starting fresh, so I don’t think I can further debug anything.

But at the time, I know it was not a networking issue on the clusters because I verified connectivity between the pods themselves before posting here. I also deleted the pods several times but hit the same error on each pod each time.

If you have a cluster of nodes, should the cluster still come up even if all 3 nodes are down? Again, none of the replicas were alive, but based on the logs it seems like each replica assumed that at least 1 other node would be running (or at least that another node would decide to be the leader, which none of them did).

There were no noteworthy K8 events.

Are there resource constraints in order to spin up a node? I was under the impression that they would max out whatever node is available to them. My replicas are running on nodes with 32 GiB of RAM, and I certainly did not have even remotely close to that amount of data.