Cannot Restart Individual Nodes in Multi-Node Setup

Description

I’m running into a connection issue where if a single node using AWS ECS Fargate is restarted due to error our maintenance. Another node cannot join it due to the IP address changing, which ends up impacting the cluster. The only logs I saw was memberlist attempting to connect on what I would guess is the old ip of the node. This ends up causing the cluster to be unusable until I restart other nodes to pick up the new IP address after resolving the FQDN defined in CLUSTER_JOIN.

Unfortunately I’m unable to get a static IP for this setup and didn’t see any alternatives besides controlling restarts of the cluster.

Is there a strategy for restarting individual nodes or does my setup require a full restart of the entire cluster to have nodes connect to each other? I would like a way to just clear the IP out and use the FQDN to reconnect in these cases.

Server Setup Information

  • Weaviate Server Version: 1.24.21
  • Deployment Method: Docker on AWS ECS Fargate
  • Multi Node? Number of Running Nodes: 5
  • Client Language and Version: Python 4.6.5
  • Multitenancy?: True

Any additional Information

I attempted to upgrade to 1.25.29 to use RAFT_ENABLE_FQDN_RESOLVER but didn’t have any luck with resolving this issue.

Hi!

Those are old versions :grimacing:

1.25+ will get your RAFT, so if you are on this version, that’s better than 1.24

Can you try increasing RAFT_BOOTSTRAP_TIMEOUT as per this doc?

Not sure about the size of your dataset, but back in 1.24 or 1.25 the load process could make it hard to the node to line up get into the cluster, so by the time it is able to communicate, the voting pool can end, then getting into a non ending cycle. As after some time, k8s can kill the pod restarting the process all over again.

Also, you can try enabling lazy loading that I am not sure it was on for those versions.

Let me know if this helps.

Thanks!

1 Like

I tried a few variations of these values without much success.

Here’s the logs I get:

memberlist: Failed UDP ping: node0 (timeout reached)
memberlist: Marking node0 as failed, suspect timeout reached (0 peer confirmations)
memberlist: Failed UDP ping: node0 (timeout reached)

Is there a way to just increase the timeout time for a node to join the cluster? Or implement retry logic on connections that is more than 2 attempts? The RAFT_BOOTSTRAP_TIMEOUT seems on the right track of this, but doesn’t cover the node itself connecting.

My suggestion is to upgrade to at least 1.28.latest as a lot of those issues were fixed with the usage of RAFT istead of member list alone.

RAFT was implemented somewhere in 1.25, so not sure you version does have it :thinking:

1 Like

Appreciate the help, I didn’t get a solution working upgrading all the way to 1.28. It seems to be a connection issue when a node attempts to get the status of the founding node and times out after 30 seconds. It seems hardcoded in the code base to do this, but uncertain if this is the actual issue or not resolving with the FQDN properly.

Thanks again.