Weaviate Cluster OOMs and Recovery

I’m the admin of a Weaviate 1.13 cluster deployed on Kubernetes that is used by a few different teams in my organisation.

Several of the replicas in the cluster have recently gone down due to OOM errors, and now it seems like they’re unable to recover and are continuously crashlooping. The logs for each of the instances has nothing of note, even when set to DEBUG level. Several classes are now inaccessible due to data loss.

What’s the best way to recover from this scenario? I assume we need to delete the missing classes and re-index them? Is there some configuration I can set in Weaviate that will stop it from indexing new content when it’s close to its memory limit?

Thanks!

Hi @Lewiky. I’ll pass that on to the team internally and someone will get back to you here.

Your Weaviate cluster is on v1.13.x version yes?

That’s right. I’ve been trying to get authorisation internally to upgrade to a newer version.

If the solution is to just upgrade because this problem is fixed - that’s music to my ears!

WIth later versions of Weaviate you are able to set GOMEMLIMIT value which should be set to 10-20% of your total memory for Weaviate. This setting greatly helps with OOM-kills. Beside that we have made numerous improvements like:

  • backup API for backuping your DB
  • roaring bitmaps - which greatly improves the performance
  • PQ compression
  • BM25 and Hybrid filters
  • replication

and much much more. I would suggest upgrading :slight_smile: but of course before doing so please make a backup :slight_smile: