Cannot Restart Individual Nodes in Multi-Node Setup

dhanshew72 · April 15, 2025, 10:44pm

Description

I’m running into a connection issue where if a single node using AWS ECS Fargate is restarted due to error our maintenance. Another node cannot join it due to the IP address changing, which ends up impacting the cluster. The only logs I saw was memberlist attempting to connect on what I would guess is the old ip of the node. This ends up causing the cluster to be unusable until I restart other nodes to pick up the new IP address after resolving the FQDN defined in CLUSTER_JOIN.

Unfortunately I’m unable to get a static IP for this setup and didn’t see any alternatives besides controlling restarts of the cluster.

Is there a strategy for restarting individual nodes or does my setup require a full restart of the entire cluster to have nodes connect to each other? I would like a way to just clear the IP out and use the FQDN to reconnect in these cases.

Server Setup Information

Weaviate Server Version: 1.24.21
Deployment Method: Docker on AWS ECS Fargate
Multi Node? Number of Running Nodes: 5
Client Language and Version: Python 4.6.5
Multitenancy?: True

Any additional Information

I attempted to upgrade to 1.25.29 to use RAFT_ENABLE_FQDN_RESOLVER but didn’t have any luck with resolving this issue.

DudaNogueira · April 16, 2025, 1:46pm

Hi!

Those are old versions

1.25+ will get your RAFT, so if you are on this version, that’s better than 1.24

Can you try increasing RAFT_BOOTSTRAP_TIMEOUT as per this doc?

Not sure about the size of your dataset, but back in 1.24 or 1.25 the load process could make it hard to the node to line up get into the cluster, so by the time it is able to communicate, the voting pool can end, then getting into a non ending cycle. As after some time, k8s can kill the pod restarting the process all over again.

Also, you can try enabling lazy loading that I am not sure it was on for those versions.

Let me know if this helps.

Thanks!

dhanshew72 · April 16, 2025, 6:18pm

I tried a few variations of these values without much success.

Here’s the logs I get:

memberlist: Failed UDP ping: node0 (timeout reached)
memberlist: Marking node0 as failed, suspect timeout reached (0 peer confirmations)
memberlist: Failed UDP ping: node0 (timeout reached)

Is there a way to just increase the timeout time for a node to join the cluster? Or implement retry logic on connections that is more than 2 attempts? The RAFT_BOOTSTRAP_TIMEOUT seems on the right track of this, but doesn’t cover the node itself connecting.

DudaNogueira · April 17, 2025, 7:56pm

My suggestion is to upgrade to at least 1.28.latest as a lot of those issues were fixed with the usage of RAFT istead of member list alone.

RAFT was implemented somewhere in 1.25, so not sure you version does have it

dhanshew72 · April 18, 2025, 1:29am

Appreciate the help, I didn’t get a solution working upgrading all the way to 1.28. It seems to be a connection issue when a node attempts to get the status of the founding node and times out after 30 seconds. It seems hardcoded in the code base to do this, but uncertain if this is the actual issue or not resolving with the FQDN properly.

Thanks again.

Topic		Replies	Views
Single-node in k8s deployment issue Support	6	235	September 23, 2024
Using Multiple Nodes with Tenancy Support	8	110	March 3, 2025
Cluster nodes not aware of eachother Support	5	1175	January 29, 2024
Issue with multi-node setup Support	1	192	April 20, 2024
"attempt to join and failed" when using PERSISTENCE_DATA_PATH env with efs storage Support bug	3	154	March 4, 2025

Cannot Restart Individual Nodes in Multi-Node Setup

Description

Server Setup Information

Any additional Information

Related topics