Should founding node be able to rejoin cluster in multinode setup?

Hello,

I’m trying to set up a 3-node Weaviate cluster on ECS. I followed the Docker tutorial (Docker | Weaviate) and other related topics on this forum. Everything works fine, but when I restart the founding node (to simulate a crash), it doesn’t reconnect to the existing cluster with the other nodes. Instead, it creates a new cluster.

Is this the expected behavior, or should the founding node be able to rejoin the original cluster? I’m using Weaviate version 1.31.0.

Thanks!

Good morning @Dominik_Doberski,

Welcome to our community! It’s lovely to have you here, mate :blush: I hope you’re having a great week. :star_struck:

What you’re experiencing isn’t the expected behavior. In a correctly configured Weaviate cluster, a node should be able to rejoin after a restart rather than forming a new cluster. Cluster membership and state should persist across restarts—assuming the configuration is intact.

Could you please share your configuration file with me? I’d be happy to take a closer look and, if possible, try to replicate the issue on my side. Any additional details you can provide would also be helpful.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

I noticed the same behaviour. On the founding node, CLUSTER_JOIN is not being set and without that, the existing memberlist cluster will not be joined (code doesn’t go into this conditional):

1 Like

@andrewisplinghoff Thank you so much! Absolutely — @Dominik_Doberski, in your configuration, please make sure to review the following documentation for multi-node setup:

See the CLUSTER_JOIN parameter. It must be set to the service name of the founding node in the cluster. This ensures proper cluster formation and rejoining behavior after restarts.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, GMT/UTC timezone)

Thank you for the reply. The documentation you referenced states that the CLUSTER_JOIN variable should only be set for nodes other than the founding node. However, my issue is that the founding node does not reconnect to the cluster after a restart. The rest of the nodes are functioning as expected.

Below is the configuration of environment variables we use for all nodes (we are deploying using Pulumi):

environment: [
{ name: 'QUERY_DEFAULTS_LIMIT', value: '25' },
{ name: 'AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED', value: 'true' },
{ name: 'PERSISTENCE_DATA_PATH', value: '/var/lib/weaviate' },
{ name: 'DEFAULT_VECTORIZER_MODULE', value: 'none' },
{ name: 'ENABLE_API_BASED_MODULES', value: 'true' },
{ name: 'LOG_LEVEL', value: 'debug' },
{ name: 'CLUSTER_HOSTNAME', value: nodeName },
{ name: 'CLUSTER_GOSSIP_BIND_PORT', value: gossipBindPort },
{ name: 'CLUSTER_DATA_BIND_PORT', value: dataBindPort },
{ name: 'RAFT_JOIN', value: nodeJoinList }, // "node1,node2,node3"
{
  name: 'RAFT_BOOTSTRAP_EXPECT',
  value: WeaviateService.NODE_NAMES.length.toString(),
},
{
  name: 'REPLICATION_MINIMUM_FACTOR',
  value: WeaviateService.NODE_NAMES.length.toString(),
},
{ name: 'DEFAULT_SHARD_COUNT', value: '1' },
...(isFirstNode
  ? []
  : [
      {
        name: 'CLUSTER_JOIN',
        value: `node1.weaviate.local:${WeaviateService.BASE_GOSSIP_PORT}`,
      },
    ]),
],

We are using AWS Cloud Map to ensure proper IP resolution between nodes. This is why node1.weaviate.local:${WeaviateService.BASE_GOSSIP_PORT} is used for the CLUSTER_JOIN variable.

I’m not entirely sure how to replicate this yet, let’s see if someone has a similar environment who can help even from community.

Would you mind raising this as an issue on the Weaviate GitHub repo? It might be RAFT-related and worth the team’s attention:

Best regards,
Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)