Whole cluster hangups on pod termination

Description

When trying to remove/restart any pod, whole cluster hangups - requests “fall off” by timeout (60sec).
API’s “/v1/nodes?output=verbose” requests freeze too.
Pods forcelly terminated by k8s after 10min waiting (terminationGracePeriodSeconds: 600).
There were no such errors in versions 1.23/24/25.

Server Setup Information

  • Weaviate Server Version: 1.26.4
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: yes, 3
  • Client Language and Version: Python3, WeaviateClient3.26
  • Multitenancy?: no

Any additional Information

Logs (internal ips also masked):
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:20:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:44766",“time”:“2024-09-17T09:20:17Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:20:37Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:60348",“time”:“2024-09-17T09:20:47Z”}
{“action”:“restapi_management”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“docker_image_tag”:“1.26.4”,“level”:“info”,“msg”:"Shutting down… ",“time”:“2024-09-17T09:20:49Z”}
{“action”:“restapi_management”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“docker_image_tag”:“1.26.4”,“level”:“info”,“msg”:“Stopped serving weaviate at http://[::]:8080”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“info”,“msg”:“closing raft FSM store …”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“info”,“msg”:“shutting down raft sub-system …”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“info”,“msg”:“closing raft-net …”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“info”,“msg”:“closing log store …”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“info”,“msg”:“closing data store …”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“info”,“msg”:“closing loaded database …”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“component”:“index_queue”,“level”:“debug”,“msg”:“index queue closed”,“shard_id”:“myclass_nwkAu2cZVXzg”,“time”:“2024-09-17T09:20:49Z”}
{“action”:“hnsw_delete_vector_cache”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:“deleting full vector cache”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“component”:“index_queue”,“level”:“debug”,“msg”:“index queue closed”,“shard_id”:“myclass_uhCGgHQxXawb”,“time”:“2024-09-17T09:20:49Z”}
{“action”:“hnsw_delete_vector_cache”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:“deleting full vector cache”,“time”:“2024-09-17T09:20:49Z”}
{“action”:“raft-net”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“error”:“transport shutdown”,“level”:“error”,“msg”:“raft-net failed to decode incoming command”,“time”:“2024-09-17T09:20:49Z”}
{“action”:“raft-net”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“error”:“transport shutdown”,“level”:“error”,“msg”:“raft-net failed to decode incoming command”,“time”:“2024-09-17T09:20:49Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:21:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:21:37Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:48332",“time”:“2024-09-17T09:21:47Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:22:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:41448",“time”:“2024-09-17T09:22:17Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.ccc.ddd:41304",“time”:“2024-09-17T09:22:34Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:22:37Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:23:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:37156",“time”:“2024-09-17T09:23:17Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.ccc.ddd:36930",“time”:“2024-09-17T09:23:34Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:23:37Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.ccc.ddd:42362",“time”:“2024-09-17T09:24:04Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:24:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:24:37Z”}

hi @wvuser !!

Do you know if this will also happen on a clean, fresh cluster or only with this specific one?

Also, this this log goes in loop for the entire terminationGracePeriodSeconds?

Cluster, schema and data-import are new (fresh). Not upgraded from previous version.

Yes, after “raft-net” error records, “synch” log records are cyclic…

Hello, @DudaNogueira!

Some detailed information about cluster/weaviate configuration:

Аccording to our observations and experiments… problem in ‘async replication’ processes (or other internal grpc communications?).
When pods hangups on terminate, synchronization logs look like this:

{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:21:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.xxx.aaa.bbb:7000",“time”:“2024-09-17T09:21:37Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:48332",“time”:“2024-09-17T09:21:47Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:22:07Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.aaa.bbb:41448",“time”:“2024-09-17T09:22:17Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.xxx.ccc.ddd:41304",“time”:“2024-09-17T09:22:34Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:22:37Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-2 10.xxx.ccc.ddd:7000",“time”:“2024-09-17T09:23:07Z”}

But, if logs look like below - pods can normally/fast restarts without freezing:

{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.aaa.bbb.ccc:59830",“time”:“2024-09-19T11:09:27Z”}
{“action”:“async_replication”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“class_name”:“MyClass”,“diff_calculation_took”:“26.801µs”,“hashbeat_iteration”:107,“hosts”:[“10.aaa.ddd.eee:7001”,“10.aaa.fff.ggg:7001”,“10.aaa.bbb.ccc:7001”],“level”:“info”,“local_objects”:0,“msg”:“hashbeat iteration successfully completed”,“object_progation_took”:“0s”,“objects_propagated”:0,“remote_objects”:0,“shard_name”:“2q4qSTnnbzwk”,“time”:“2024-09-19T11:09:34Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Initiating push/pull sync with: weaviate-1 10.aaa.bbb.ccc:7000",“time”:“2024-09-19T11:09:38Z”}
{“action”:“async_replication”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“class_name”:“MyClass”,“diff_calculation_took”:“24.166µs”,“hashbeat_iteration”:107,“hosts”:[“10.aaa.fff.ggg:7001”,“10.aaa.ddd.eee:7001”,“10.aaa.bbb.ccc:7001”],“level”:“info”,“local_objects”:0,“msg”:“hashbeat iteration successfully completed”,“object_progation_took”:“0s”,“objects_propagated”:0,“remote_objects”:0,“shard_name”:“s5zw0VLZnClB”,“time”:“2024-09-19T11:09:39Z”}
{“action”:“async_replication”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“class_name”:“MyClass”,“diff_calculation_took”:“23.906µs”,“hashbeat_iteration”:107,“hosts”:[“10.aaa.fff.ggg:7001”,“10.aaa.ddd.eee:7001”,“10.aaa.bbb.ccc:7001”],“level”:“info”,“local_objects”:0,“msg”:“hashbeat iteration successfully completed”,“object_progation_took”:“0s”,“objects_propagated”:0,“remote_objects”:0,“shard_name”:“qUdhGTwM36c5”,“time”:“2024-09-19T11:09:42Z”}
{“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“level”:“debug”,“msg”:" memberlist: Stream connection from=10.aaa.fff.ggg:39338",“time”:“2024-09-19T11:09:50Z”}
{“action”:“async_replication”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“class_name”:“MyClass”,“diff_calculation_took”:“24.117µs”,“hashbeat_iteration”:108,“hosts”:[“10.aaa.bbb.ccc:7001”,“10.aaa.ddd.eee:7001”,“10.aaa.fff.ggg:7001”],“level”:“info”,“local_objects”:0,“msg”:“hashbeat iteration successfully completed”,“object_progation_took”:“0s”,“objects_propagated”:0,“remote_objects”:0,“shard_name”:“2q4qSTnnbzwk”,“time”:“2024-09-19T11:09:54Z”}
{“action”:“async_replication”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“class_name”:“MyClass”,“diff_calculation_took”:“22.584µs”,“hashbeat_iteration”:108,“hosts”:[“10.aaa.ddd.eee:7001”,“10.aaa.fff.ggg:7001”,“10.aaa.bbb.ccc:7001”],“level”:“info”,“local_objects”:0,“msg”:“hashbeat iteration successfully completed”,“object_progation_took”:“0s”,“objects_propagated”:0,“remote_objects”:0,“shard_name”:“s5zw0VLZnClB”,“time”:“2024-09-19T11:09:59Z”}
{“action”:“async_replication”,“build_git_commit”:“584532a”,“build_go_version”:“go1.21.13”,“build_image_tag”:“1.26.4”,“build_wv_version”:“1.26.4”,“class_name”:“MyClass”,“diff_calculation_took”:“25.238µs”,“hashbeat_iteration”:108,“hosts”:[“10.aaa.fff.ggg:7001”,“10.aaa.bbb.ccc:7001”,“10.aaa.ddd.eee:7001”],“level”:“info”,“local_objects”:0,“msg”:“hashbeat iteration successfully completed”,“object_progation_took”:“0s”,“objects_propagated”:0,“remote_objects”:0,“shard_name”:“qUdhGTwM36c5”,“time”:“2024-09-19T11:10:02Z”}

Also noticed, that the “async replication” sometimes stops working… (/schema replies that: “asyncEnabled”: true), but only re-update this param to ‘true’ again, helps activate async replication).