Hi community !
I’m deploying Weaviate on k8s with 5 replicas. I’ve seen errors on replicas communication where one replica fails the readiness probe and I see the following error in the log
{"level":"info","msg":" memberlist: Suspect weaviate-3 has failed, no acks received","time":"2023-11-02T22:27:28Z"}
But the liveness probe passes, so the pod is not recreated. I need to manually recreate the pod.
Today I’ve seen a split brain situation where 2 replicas were able to communicate with each other but no with the other 3. The error was the same as above, but as in this case there was a communication btw some replicas the readiness probe passed.
- Is there a way to monitor the cluster memebership status ?
- Are those issues tracked somewhere?