Description
When Kubernetes triggers a pod restart, it frequently happens that the pod doesn’t manage to shut down cleanly. The error message (with varying indexes and shards) is
panic: close database: shutdown index "index_3880": shutdown shard "4rhTw8Ju9HzC": stop lsmkv store: shutdown bucket "property_source_searchable" of store "/var/lib/weaviate/index_3880/4rhTw8Ju9HzC/lsm":
long-running compaction in progress: context deadline exceeded
As this issue occurs exactly 60 seconds after the Kubernetes event reason=Killing msg="Stopping container weaviate"
, a timeout seems to be reached. Within Kubernetes, I configured terminationGracePeriodSeconds: 600
though.
What might be the issue here? Is there a parameter to override the timeout waiting for the long-running compaction to finish?
Server Setup Information
- Weaviate Server Version: 1.28.8
- Deployment Method: k8s with Helm
- Multi Node? Number of Running Nodes: 3
- Client Language and Version: -
- Multitenancy?: No
Any additional Information
I set two timeouts as container arguments for the StatefulSet:
- --read-timeout=60s
- --write-timeout=60s
Via environment variables, the RAFT bootstrap timeout is set to 600. Other than that, neither the environment variables nor the --config-file
contain anything timeout-related.
For HA, a replication minimum factor of “3” was configured using the environment variables.
Hello @akerkau,
I came across the long-running compaction in progress: context deadline exceeded before. There have been improvements related to compaction in the more recent versions of Weaviate. The panic-related seen in version 1.28 have been resolved in newer releases.
I don’t think adjusting the StatefulSet timeout or the RAFT bootstrap timeout would help at all.
I’d recommend upgrading to the latest Weaviate version and restart the cluster.
Best regards,
Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)
Thanks for your prompt response.
I was wondering how the context in this source section is determined but got lost in code analysis:
$ less adapters/repos/db/lsmkv/segment_group.go
[…]
func (sg *SegmentGroup) shutdown(ctx context.Context) error {
if err := sg.compactionCallbackCtrl.Unregister(ctx); err != nil {
return fmt.Errorf("long-running compaction in progress: %w", ctx.Err())
}
[…]
If neither of the mentioned timeout parameters has any impact, I assume that this uses some default, hard-coded Golang context with its timeouts.
Regarding the Weaviate versions, I had already glanced over the release notes but couldn’t find concrete hints on compaction improvements.
Nonetheless, I had already upgraded to v1.31.4 when I created my post. Based on your feedback, I have upgraded to v1.32.1 now.
I can confirm that the mentioned errors didn’t occur during my restarts since. I am still worried though that during high-load situations or incidents the behavior might reappear… i.e. precisely at the moment when I need it the least 