Node Desync and Cluster Inconsistencies After OOM on Weaviate-0

hi @andrewisplinghoff !!

From our knowledge base:

When you see the error “hashbeat iteration failed: collecting differences”, it typically indicates one of these situations:

  • The nodes are having communication issues between each other
  • There might be nodes that are down or having issues with their synchronization
  • If async replication is enabled, it could mean there are challenges in maintaining consistency across shards

In most cases, this error is a symptom rather than the root problem. If you’re seeing this error, you should:

  • Check if all nodes are up and running properly
  • Verify network connectivity between nodes
  • Consider checking the disk space and resource utilization

If you are using async replication, what can be happening is that the node could be overwhelmed with all the async replication operations at it’s bootstrap, and doesn’t have enough resources in order to answer for the RAFT elections.

On that case, you can try tweaking the ASYNC_ env vars, specially ASYNC_REPLICATION_PROPAGATION_CONCURRENCY in order to lower the ASYNC operations and leave some resource for RAFT.

Let me know if that helps!

Thanks!