Should founding node be able to rejoin cluster in multinode setup?

Hello,

I’m trying to set up a 3-node Weaviate cluster on ECS. I followed the Docker tutorial (Docker | Weaviate) and other related topics on this forum. Everything works fine, but when I restart the founding node (to simulate a crash), it doesn’t reconnect to the existing cluster with the other nodes. Instead, it creates a new cluster.

Is this the expected behavior, or should the founding node be able to rejoin the original cluster? I’m using Weaviate version 1.31.0.

Thanks!

Good morning @Dominik_Doberski,

Welcome to our community! It’s lovely to have you here, mate :blush: I hope you’re having a great week. :star_struck:

What you’re experiencing isn’t the expected behavior. In a correctly configured Weaviate cluster, a node should be able to rejoin after a restart rather than forming a new cluster. Cluster membership and state should persist across restarts—assuming the configuration is intact.

Could you please share your configuration file with me? I’d be happy to take a closer look and, if possible, try to replicate the issue on my side. Any additional details you can provide would also be helpful.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

I noticed the same behaviour. On the founding node, CLUSTER_JOIN is not being set and without that, the existing memberlist cluster will not be joined (code doesn’t go into this conditional):

1 Like

@andrewisplinghoff Thank you so much! Absolutely — @Dominik_Doberski, in your configuration, please make sure to review the following documentation for multi-node setup:

See the CLUSTER_JOIN parameter. It must be set to the service name of the founding node in the cluster. This ensures proper cluster formation and rejoining behavior after restarts.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

Thank you for the reply. The documentation you referenced states that the CLUSTER_JOIN variable should only be set for nodes other than the founding node. However, my issue is that the founding node does not reconnect to the cluster after a restart. The rest of the nodes are functioning as expected.

Below is the configuration of environment variables we use for all nodes (we are deploying using Pulumi):

environment: [
{ name: 'QUERY_DEFAULTS_LIMIT', value: '25' },
{ name: 'AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED', value: 'true' },
{ name: 'PERSISTENCE_DATA_PATH', value: '/var/lib/weaviate' },
{ name: 'DEFAULT_VECTORIZER_MODULE', value: 'none' },
{ name: 'ENABLE_API_BASED_MODULES', value: 'true' },
{ name: 'LOG_LEVEL', value: 'debug' },
{ name: 'CLUSTER_HOSTNAME', value: nodeName },
{ name: 'CLUSTER_GOSSIP_BIND_PORT', value: gossipBindPort },
{ name: 'CLUSTER_DATA_BIND_PORT', value: dataBindPort },
{ name: 'RAFT_JOIN', value: nodeJoinList }, // "node1,node2,node3"
{
  name: 'RAFT_BOOTSTRAP_EXPECT',
  value: WeaviateService.NODE_NAMES.length.toString(),
},
{
  name: 'REPLICATION_MINIMUM_FACTOR',
  value: WeaviateService.NODE_NAMES.length.toString(),
},
{ name: 'DEFAULT_SHARD_COUNT', value: '1' },
...(isFirstNode
  ? []
  : [
      {
        name: 'CLUSTER_JOIN',
        value: `node1.weaviate.local:${WeaviateService.BASE_GOSSIP_PORT}`,
      },
    ]),
],

We are using AWS Cloud Map to ensure proper IP resolution between nodes. This is why node1.weaviate.local:${WeaviateService.BASE_GOSSIP_PORT} is used for the CLUSTER_JOIN variable.

I’m not entirely sure how to replicate this yet, let’s see if someone has a similar environment who can help even from community.

Would you mind raising this as an issue on the Weaviate GitHub repo? It might be RAFT-related and worth the team’s attention:

Best regards,
Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)

Adding to this thread, I’ve had issues with the founding node crashing on AWS ECS because of this constant (AWS ECS is slow to deploy nodes):

Founding nodes that take longer than 30 seconds to restart end up spinning up without other nodes joining than just hang until restarting the rest of the nodes. I fought against this a little by writing a custom restart job, but can’t handle unexpected crashes of the node without manual intervention.

Is there a plan to configure this value?

Thanks for the replies. I created an issue Founding node is unable to rejoin cluster on ECS · Issue #8423 · weaviate/weaviate · GitHub

Hey @Dominik_Doberski,

Thanks for reporting the issue.

I’m curious—what value does CLUSTER_JOIN have for the node after it restarts? From the snippet you shared, it looks like CLUSTER_JOIN might be empty. One possible reason could be the logic behind isFirstNode.

If you get a chance, could you debug the node’s environment variables and check what value is being set for CLUSTER_JOIN? The behavior you’re seeing does seem consistent with it being empty.

Thanks!

...(isFirstNode
  ? []
  : [
      {
        name: 'CLUSTER_JOIN',
        value: `node1.weaviate.local:${WeaviateService.BASE_GOSSIP_PORT}`,
      },
    ]),
],