[How to recover from Weaviate cluster crash due to memory limit?]

Description

Hi Weaviate community, I have a Weaviate cluster running under AWS EKS, it crashed earlier when importing data because it hit the allocated memory limit.
Here are some error logs:

“No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true”

“active write-ahead-log found. Did weaviate crash prior to this? Trying to recover…”
It seems to be a deadlock state: the EKS keeps restarting the service and Weaviate keep crashing – now that I cannot connect to the weaviate from client side and delete some data to meet the memory requirement. Nor deploy new resources (managed through Helm Chart) to the weaviate pod works as it keep crashing that the new resource config failed to be in sync by the ArgoCD.
Is there a way to solve this issue? Any suggestion would be very appreciated, thank you!

ps: I added LIMIT_RESOURCES: true as the env variable in helm chart after this crash – but again deploy new config changes to the weaviate is not synced by weaviate cluster through ArgoCD because it keeps crashing.

Server Setup Information

  • Weaviate Server Version: 1.23.7
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: 3.21.0

Any additional Information

Hi!

Depending on the amount of data you have, it may take some time to startup Weaviate, and liveness and readiness probes from K8s may timeout, forcing a restart of the nodes.

Can you try increasing those values?

Let me know if this helps, otherwise I can ask for help from our SRE team :slight_smile:

Didn’t find a solution to break the ‘deadlock’ state.
I ended with deleting the pod completely and recreate one to unbolck.