Memberlist: Suspect weaviate-x has failed, no acks received

Hi community !

I’m deploying Weaviate on k8s with 5 replicas. I’ve seen errors on replicas communication where one replica fails the readiness probe and I see the following error in the log

{"level":"info","msg":" memberlist: Suspect weaviate-3 has failed, no acks received","time":"2023-11-02T22:27:28Z"}

But the liveness probe passes, so the pod is not recreated. I need to manually recreate the pod.

Today I’ve seen a split brain situation where 2 replicas were able to communicate with each other but no with the other 3. The error was the same as above, but as in this case there was a communication btw some replicas the readiness probe passed.

  • Is there a way to monitor the cluster memebership status ?
  • Are those issues tracked somewhere?

Hi @Natasha_Tomattis

I will check internally if we have seen this kind of issue.

Thanks for reporting.

Ah. by the way, in order to check the nodes status, you can monitor this endpoint:

Thanks ! Are those metrics exported to prometheus ?

Another question, I have a bunch of errors around

{"level":"error","msg":"\"10.227.22.182:7001\": connect: Post \"http://10.227.22.182:7001/replicas/indices/Embedding_e4103da1e6f34f3e9fea894d530f7733/shards/koVJ5jkRMZfE/objects?request_id=weaviate-2-01-18baa1f81df-1\": context canceled","op":"broadcast","time":"2023-11-07T14:11:13Z"}

These errors are constant , do you have a clue about what can cause them ?