Unclean shutdown of nodes in Kubernetes ("panic: close database")

akerkau · July 21, 2025, 1:07pm

Description

When Kubernetes triggers a pod restart, it frequently happens that the pod doesn’t manage to shut down cleanly. The error message (with varying indexes and shards) is

panic:  close database: shutdown index "index_3880": shutdown shard "4rhTw8Ju9HzC": stop lsmkv store: shutdown bucket "property_source_searchable" of store "/var/lib/weaviate/index_3880/4rhTw8Ju9HzC/lsm":
    long-running compaction in progress: context deadline exceeded

As this issue occurs exactly 60 seconds after the Kubernetes event reason=Killing msg="Stopping container weaviate", a timeout seems to be reached. Within Kubernetes, I configured terminationGracePeriodSeconds: 600 though.

What might be the issue here? Is there a parameter to override the timeout waiting for the long-running compaction to finish?

Server Setup Information

Weaviate Server Version: 1.28.8
Deployment Method: k8s with Helm
Multi Node? Number of Running Nodes: 3
Client Language and Version: -
Multitenancy?: No

Any additional Information

I set two timeouts as container arguments for the StatefulSet:

          - --read-timeout=60s
          - --write-timeout=60s

Via environment variables, the RAFT bootstrap timeout is set to 600. Other than that, neither the environment variables nor the --config-file contain anything timeout-related.

For HA, a replication minimum factor of “3” was configured using the environment variables.

Mohamed_Shahin · July 21, 2025, 1:49pm

Hello @akerkau,

I came across the long-running compaction in progress: context deadline exceeded before. There have been improvements related to compaction in the more recent versions of Weaviate. The panic-related seen in version 1.28 have been resolved in newer releases.

I don’t think adjusting the StatefulSet timeout or the RAFT bootstrap timeout would help at all.

I’d recommend upgrading to the latest Weaviate version and restart the cluster.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)

akerkau · July 28, 2025, 9:47am

Thanks for your prompt response.

I was wondering how the context in this source section is determined but got lost in code analysis:

$ less adapters/repos/db/lsmkv/segment_group.go
[…]
func (sg *SegmentGroup) shutdown(ctx context.Context) error {
        if err := sg.compactionCallbackCtrl.Unregister(ctx); err != nil {
                return fmt.Errorf("long-running compaction in progress: %w", ctx.Err())
        }
[…]

If neither of the mentioned timeout parameters has any impact, I assume that this uses some default, hard-coded Golang context with its timeouts.

Regarding the Weaviate versions, I had already glanced over the release notes but couldn’t find concrete hints on compaction improvements.
Nonetheless, I had already upgraded to v1.31.4 when I created my post. Based on your feedback, I have upgraded to v1.32.1 now.

I can confirm that the mentioned errors didn’t occur during my restarts since. I am still worried though that during high-load situations or incidents the behavior might reappear… i.e. precisely at the moment when I need it the least

Topic		Replies	Views
Context Deadline Exceeded Support technical	11	448	May 3, 2025
Whole cluster hangups on pod termination Support bug	3	203	September 19, 2024
,"panic":"runtime error: index out of range [2] with length 2" Support bug	3	222	December 6, 2024
Weaviate Shutting Down Automatically Support developer-experience	5	279	September 30, 2024
Node crashloop in K8 deployment Support	4	397	August 9, 2024

Unclean shutdown of nodes in Kubernetes ("panic: close database")

Description

Server Setup Information

Any additional Information

Related topics