I’m the admin of a Weaviate 1.13 cluster deployed on Kubernetes that is used by a few different teams in my organisation.
Several of the replicas in the cluster have recently gone down due to OOM errors, and now it seems like they’re unable to recover and are continuously crashlooping. The logs for each of the instances has nothing of note, even when set to DEBUG level. Several classes are now inaccessible due to data loss.
What’s the best way to recover from this scenario? I assume we need to delete the missing classes and re-index them? Is there some configuration I can set in Weaviate that will stop it from indexing new content when it’s close to its memory limit?
WIth later versions of Weaviate you are able to set GOMEMLIMIT value which should be set to 10-20% of your total memory for Weaviate. This setting greatly helps with OOM-kills. Beside that we have made numerous improvements like:
backup API for backuping your DB
roaring bitmaps - which greatly improves the performance