Issue: Weaviate Cluster Setup with Docker on Different Servers Failing

Description

I’m trying to set up a Weaviate cluster consisting of 3 nodes, each running on a different server, using Docker. However, I’m encountering an issue where the nodes fail to join the cluster. The error message I’m seeing is:

code{"action":"bootstrap","error":"could not join a cluster from [172.18.0.2:8300]","level":"warning","msg":"failed to join cluster, will notify next if voter","servers":["172.18.0.2:8300"],"time":"2024-08-08T08:02:08Z","voter":true}{"action":"bootstrap","candidates":[{"Suffrage":0,"ID":"192.168.1.52","Address":"172.18.0.2:8300"}],"level":"info","msg":"starting cluster bootstrapping","time":"2024-08-08T08:02:08Z"}

Below is the docker-compose.yml file I’m using for each node:

codeversion: '3.7
'services:weaviate:image: cr.weaviate.io/semitechnologies/weaviate:1.25.4
environment:
- AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true
- ASYNC_INDEXING=true
- PERSISTENCE_DATA_PATH=/var/lib/weaviate
- ENABLE_MODULES=text2vec-ollama,generative-ollama
- RAFT_JOIN=IP1:8300,IP2:8300,IP3:8300
- CLUSTER_HOSTNAME=IP1
- CLUSTER_GOSSIP_BIND_PORT=7100
- CLUSTER_GOSSIP_JOIN=IP2:7100,IP3:7100
- ports:
- 8080:8080
- 50051:50051
- volumes:
- weaviate_data1:/var/lib/weaviatenetworks:
- weaviate-clusternetworks:weaviate-cluster:
- driver: bridgevolumes:weaviate_data1:

Each server has a unique IP address (IP1, IP2, IP3). Despite correctly setting the RAFT_JOIN and CLUSTER_GOSSIP_JOIN environment variables, the nodes aren’t able to join the cluster, and the error above persists.

Has anyone experienced this issue before or can provide insights on how to resolve it? Any help would be greatly appreciated!

Server Setup Information

  • Weaviate Server Version: v1.25.4
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 3
  • Client Language and Version: python v4
  • Multitenancy?:

Hi @Mariam !!

Have you seen this docker compose for multi node?

Check here a working docker-compose.yaml.

Note that you will have to change it for the ips, and also map the ports defined in each node, as they are not in the same network.

Also note: this is not the best way to run a multi node Weaviate server. For multinode deployments, we suggest Kubernetes.

Let me know if this helps!

Thanks!

hi @DudaNogueira

I am still facing issues, please see if I did anything wrong in the docker configuration.
Description:
I am currently configuring a Weaviate cluster with one master node and two worker node, but I am encountering issues with node communication. Below are the details of my setup and the errors I am seeing.
Master Node Configuration:
version: ‘3.7’
services:
weaviate-node-1:
command:
- --host
- 0.0.0.0
- --port
- ‘8080’
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.26.1
ports:
- 8080:8080
- 6060:6060
- 50051:50051
- 7100:7100
- 7101:7101
- 8300:8300
restart: on-failure:0
volumes:
- ./data-node-1:/var/lib/weaviate
environment:
LOG_LEVEL: ‘debug’
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: ‘/var/lib/weaviate’
ENABLE_MODULES: ‘text2vec-openai,text2vec-cohere,text2vec-huggingface,text2vec-ollama,generative-ollama’
DEFAULT_VECTORIZER_MODULE: ‘none’
CLUSTER_HOSTNAME: ‘node1’
CLUSTER_GOSSIP_BIND_PORT: ‘7100’
CLUSTER_DATA_BIND_PORT: ‘7101’
RAFT_JOIN: ‘192.168.1.52:8300,192.168.1.23:8300,192.168.1.24:8300’
RAFT_BOOTSTRAP_EXPECT: 3

Master Node Error Log:
{“action”:“raft-net”,“error”:“unknown rpc type 255”,“level”:“error”,“msg”:“raft-net failed to decode incoming command”,“time”:“2024-08-10T06:37:12Z”}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-10T06:37:21Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-10T06:37:31Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“level”:“debug”,“msg”:" memberlist: Stream connection from=192.168.1.23:36206",“time”:“2024-08-10T06:37:40Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node2 (timeout reached)“,“time”:“2024-08-10T06:37:41Z”}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-10T06:37:41Z”,“url”:{“Scheme”:”“,“Opaque”:”“,“User”:null,“Host”:”“,“Path”:”/metrics",“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“level”:“info”,“msg”:" memberlist: Suspect node2 has failed, no acks received",“time”:“2024-08-10T06:37:41Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node2 (timeout reached)“,“time”:“2024-08-10T06:37:43Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node2 has failed, no acks received",“time”:“2024-08-10T06:37:44Z”}
{“level”:“info”,“msg”:" memberlist: Marking node2 as failed, suspect timeout reached (0 peer confirmations)“,“time”:“2024-08-10T06:37:45Z”}
{“level”:“debug”,“msg”:” memberlist: Failed UDP ping: node2 (timeout reached)“,“time”:“2024-08-10T06:37:46Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node2 has failed, no acks received",“time”:“2024-08-10T06:37:48Z”}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-10T06:37:51Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-10T06:38:01Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-10T06:38:11Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}

Worker Node Configuration
version: ‘3.7’
services:
weaviate-node-2:
init: true
command:
- --host
- 0.0.0.0
- --port
- ‘8080’
- --scheme
- http
image: cr.weaviate.io/semitechnologies/weaviate:1.26.1
ports:
- 8081:8080
- 6061:6060
- 50052:50051
- 7102:7102
- 7103:7103
- 8300:8300
restart: on-failure:0
volumes:
- ./data-node-2:/var/lib/weaviate
environment:
LOG_LEVEL: ‘debug’
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: ‘/var/lib/weaviate’
ENABLE_MODULES: ‘text2vec-openai,text2vec-cohere,text2vec-huggingface,text2vec-ollama,generative-ollama’
DEFAULT_VECTORIZER_MODULE: ‘none’
CLUSTER_HOSTNAME: ‘node2’
CLUSTER_GOSSIP_BIND_PORT: ‘7102’
CLUSTER_DATA_BIND_PORT: ‘7103’
CLUSTER_JOIN: ‘192.168.1.52:7100’
RAFT_JOIN: ‘192.168.1.52:8300,192.168.1.23:8300,192.168.1.24:8300’
RAFT_BOOTSTRAP_EXPECT: 3

Worker Node Error Log:
{“action”:“inverted filter2search migration”,“level”:“debug”,“msg”:“starting switching fallback mode”,“time”:“2024-08-10T06:37:42Z”}
{“action”:“inverted filter2search migration”,“level”:“debug”,“msg”:“no missing filterable indexes, fallback mode skipped”,“time”:“2024-08-10T06:37:42Z”}
{“docker_image_tag”:“1.26.1”,“level”:“info”,“msg”:“configured versions”,“server_version”:“1.26.1”,“time”:“2024-08-10T06:37:42Z”}
{“action”:“grpc_startup”,“level”:“info”,“msg”:“grpc server listening at [::]:50051”,“time”:“2024-08-10T06:37:42Z”}
{“address”:“172.20.0.2:8300”,“level”:“info”,“msg”:“current Leader”,“time”:“2024-08-10T06:37:42Z”}
{“level”:“info”,“msg”:“starting migration from old schema”,“time”:“2024-08-10T06:37:42Z”}
{“level”:“info”,“msg”:“legacy schema is empty, nothing to migrate”,“time”:“2024-08-10T06:37:42Z”}
{“level”:“info”,“msg”:“migration from the old schema has been successfully completed”,“time”:“2024-08-10T06:37:42Z”}
{“action”:“restapi_management”,“docker_image_tag”:“1.26.1”,“level”:“info”,“msg”:“Serving weaviate at http://[::]:8080”,“time”:“2024-08-10T06:37:42Z”}
{“action”:“telemetry_push”,“level”:“info”,“msg”:“telemetry started”,“payload”:“\u0026{MachineID:b6f038ed-5bac-4f0d-8b9e-be97ac935689 Type:INIT Version:1.26.1 NumObjects:0 OS:linux Arch:amd64 UsedModules:}”,“time”:“2024-08-10T06:37:42Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node1 (timeout reached)“,“time”:“2024-08-10T06:37:43Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node1 has failed, no acks received",“time”:“2024-08-10T06:37:45Z”}
{“level”:“info”,“msg”:" memberlist: Marking node1 as failed, suspect timeout reached (0 peer confirmations)“,“time”:“2024-08-10T06:37:46Z”}
{“level”:“debug”,“msg”:” memberlist: Failed UDP ping: node1 (timeout reached)“,“time”:“2024-08-10T06:37:46Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node1 has failed, no acks received",“time”:“2024-08-10T06:37:49Z”}
connectivity is available
telnet 192.168.1.52 8300
Trying 192.168.1.52…
Connected to 192.168.1.52.
Escape character is ‘^]’.

this is from node 2 to node 1

hi @Mariam !

There is very similar issue here:

Please, let’s move this discussion there as it looks like you have the same issue.

Thanks!