Weaviate Cluster Setup with Docker on Different Servers Failing

Hello everyone,
I’m trying to set up a Weaviate cluster consisting of 3 nodes, each running on a different server, using Docker. However, I’m encountering an issue where the nodes fail to join the cluster. The error message I’m seeing is:
code{“action”:“bootstrap”,“error”:“could not join a cluster from [172.18.0.2:8300]”,“level”:“warning”,“msg”:“failed to join cluster, will notify next if voter”,“servers”:[“172.18.0.2:8300”],“time”:“2024-08-08T08:02:08Z”,“voter”:true}{“action”:“bootstrap”,“candidates”:[{“Suffrage”:0,“ID”:“192.168.1.52”,“Address”:“172.18.0.2:8300”}],“level”:“info”,“msg”:“starting cluster bootstrapping”,“time”:“2024-08-08T08:02:08Z”}

Below is the docker-compose.yml file I’m using for each node:
codeversion: '3.7’services:weaviate:image: cr.weaviate.io/semitechnologies/weaviate:1.25.4environment:- AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED=true- ASYNC_INDEXING=true- PERSISTENCE_DATA_PATH=/var/lib/weaviate- ENABLE_MODULES=text2vec-ollama,generative-ollama- RAFT_JOIN=192.168.1.52:8300,192.168.1.23:8300,192.168.1.24:8300- CLUSTER_HOSTNAME=192.168.1.52- CLUSTER_GOSSIP_BIND_PORT=7100- CLUSTER_GOSSIP_JOIN=192.168.1.23:7100,192.168.1.24:7100ports:- 8080:8080- 50051:50051volumes:- weaviate_data1:/var/lib/weaviatenetworks:- weaviate-clusternetworks:weaviate-cluster:driver: bridgevolumes:weaviate_data1:

Each server has a unique IP address (192.168.1.52, 192.168.1.23, 192.168.1.24). Despite correctly setting the RAFT_JOIN and CLUSTER_GOSSIP_JOIN environment variables, the nodes aren’t able to join the cluster, and the error above persists.
Has anyone experienced this issue before or can provide insights on how to resolve it? Any help would be greatly appreciated!
Thank you!

Hi!

Are you mapping the Weaviate comms port?

for example, 7100, 8300, etc?

Otherwise, each node cannot communicate with each other.

Thanks for your reply.

I am currently configuring a Weaviate cluster with one master node and two worker node, but I am encountering issues with node communication. Below are the details of my setup and the errors I am seeing.
Master Node Configuration:

version: '3.7'
services:
  weaviate-node-1:
    command:
    - --host
    - 0.0.0.0
    - --port
    - '8080'
    - --scheme
    - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.26.1
    ports:
    - 8080:8080
    - 6060:6060
    - 50051:50051
    - 7100:7100
    - 7101:7101
    - 8300:8300
    restart: on-failure:0
    volumes:
      - ./data-node-1:/var/lib/weaviate
    environment:
      LOG_LEVEL: 'debug'
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_MODULES: 'text2vec-openai,text2vec-cohere,text2vec-huggingface,text2vec-ollama,generative-ollama'
      DEFAULT_VECTORIZER_MODULE: 'none'
      CLUSTER_HOSTNAME: 'node1'
      CLUSTER_GOSSIP_BIND_PORT: '7100'
      CLUSTER_DATA_BIND_PORT: '7101'
      RAFT_JOIN: '192.168.1.52:8300,192.168.1.23:8300,192.168.1.24:8300'
      RAFT_BOOTSTRAP_EXPECT: 3

Master Node Error Log:

{"action":"raft-net","error":"unknown rpc type 255","level":"error","msg":"raft-net failed to decode incoming command","time":"2024-08-10T06:37:12Z"}
{"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-08-10T06:37:21Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/metrics","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
{"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-08-10T06:37:31Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/metrics","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
{"level":"debug","msg":" memberlist: Stream connection from=192.168.1.23:36206","time":"2024-08-10T06:37:40Z"}
{"level":"debug","msg":" memberlist: Failed UDP ping: node2 (timeout reached)","time":"2024-08-10T06:37:41Z"}
{"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-08-10T06:37:41Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/metrics","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
{"level":"info","msg":" memberlist: Suspect node2 has failed, no acks received","time":"2024-08-10T06:37:41Z"}
{"level":"debug","msg":" memberlist: Failed UDP ping: node2 (timeout reached)","time":"2024-08-10T06:37:43Z"}
{"level":"info","msg":" memberlist: Suspect node2 has failed, no acks received","time":"2024-08-10T06:37:44Z"}
{"level":"info","msg":" memberlist: Marking node2 as failed, suspect timeout reached (0 peer confirmations)","time":"2024-08-10T06:37:45Z"}
{"level":"debug","msg":" memberlist: Failed UDP ping: node2 (timeout reached)","time":"2024-08-10T06:37:46Z"}
{"level":"info","msg":" memberlist: Suspect node2 has failed, no acks received","time":"2024-08-10T06:37:48Z"}
{"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-08-10T06:37:51Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/metrics","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
{"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-08-10T06:38:01Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/metrics","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}
{"action":"restapi_request","level":"debug","method":"GET","msg":"received HTTP request","time":"2024-08-10T06:38:11Z","url":{"Scheme":"","Opaque":"","User":null,"Host":"","Path":"/metrics","RawPath":"","OmitHost":false,"ForceQuery":false,"RawQuery":"","Fragment":"","RawFragment":""}}

Worker Node Configuration

version: '3.7'
services:
  weaviate-node-2:
    init: true
    command:
      - --host
      - 0.0.0.0
      - --port
      - '8080'
      - --scheme
      - http
    image: cr.weaviate.io/semitechnologies/weaviate:1.26.1
    ports:
      - 8081:8080
      - 6061:6060
      - 50052:50051
      - 7102:7102
      - 7103:7103
      - 8300:8300
    restart: on-failure:0
    volumes:
      - ./data-node-2:/var/lib/weaviate
    environment:
      LOG_LEVEL: 'debug'
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: 'true'
      PERSISTENCE_DATA_PATH: '/var/lib/weaviate'
      ENABLE_MODULES: 'text2vec-openai,text2vec-cohere,text2vec-huggingface,text2vec-ollama,generative-ollama'
      DEFAULT_VECTORIZER_MODULE: 'none'
      CLUSTER_HOSTNAME: 'node2'
      CLUSTER_GOSSIP_BIND_PORT: '7102'
      CLUSTER_DATA_BIND_PORT: '7103'
      CLUSTER_JOIN: '192.168.1.52:7100'
      RAFT_JOIN: '192.168.1.52:8300,192.168.1.23:8300,192.168.1.24:8300'
      RAFT_BOOTSTRAP_EXPECT: 3

Worker Node Error Log:

{"action":"inverted filter2search migration","level":"debug","msg":"starting switching fallback mode","time":"2024-08-10T06:37:42Z"}
{"action":"inverted filter2search migration","level":"debug","msg":"no missing filterable indexes, fallback mode skipped","time":"2024-08-10T06:37:42Z"}
{"docker_image_tag":"1.26.1","level":"info","msg":"configured versions","server_version":"1.26.1","time":"2024-08-10T06:37:42Z"}
{"action":"grpc_startup","level":"info","msg":"grpc server listening at [::]:50051","time":"2024-08-10T06:37:42Z"}
{"address":"172.20.0.2:8300","level":"info","msg":"current Leader","time":"2024-08-10T06:37:42Z"}
{"level":"info","msg":"starting migration from old schema","time":"2024-08-10T06:37:42Z"}
{"level":"info","msg":"legacy schema is empty, nothing to migrate","time":"2024-08-10T06:37:42Z"}
{"level":"info","msg":"migration from the old schema has been successfully completed","time":"2024-08-10T06:37:42Z"}
{"action":"restapi_management","docker_image_tag":"1.26.1","level":"info","msg":"Serving weaviate at http://[::]:8080","time":"2024-08-10T06:37:42Z"}
{"action":"telemetry_push","level":"info","msg":"telemetry started","payload":"\u0026{MachineID:b6f038ed-5bac-4f0d-8b9e-be97ac935689 Type:INIT Version:1.26.1 NumObjects:0 OS:linux Arch:amd64 UsedModules:[]}","time":"2024-08-10T06:37:42Z"}
{"level":"debug","msg":" memberlist: Failed UDP ping: node1 (timeout reached)","time":"2024-08-10T06:37:43Z"}
{"level":"info","msg":" memberlist: Suspect node1 has failed, no acks received","time":"2024-08-10T06:37:45Z"}
{"level":"info","msg":" memberlist: Marking node1 as failed, suspect timeout reached (0 peer confirmations)","time":"2024-08-10T06:37:46Z"}
{"level":"debug","msg":" memberlist: Failed UDP ping: node1 (timeout reached)","time":"2024-08-10T06:37:46Z"}
{"level":"info","msg":" memberlist: Suspect node1 has failed, no acks received","time":"2024-08-10T06:37:49Z"}

connectivity is available
telnet 192.168.1.52 8300
Trying 192.168.1.52…
Connected to 192.168.1.52.
Escape character is ‘^]’.

this is from node 2 to node 1

hi!!

Check here a working example of a multi node running in docker:

I believe you should not have the port on the RAFT_JOIN. The example uses only

RAFT_JOIN: 'node1,node2,node3'

Let me know if this helps.

Thanks!

Thanks for your suggestion :slight_smile: we tried all the best possible ways still we are unable to solve it. This is issue we are facing currently.

master node
version: ‘3.7’
services:
node1:
container_name: node1
command:
- --host
- 0.0.0.0
- --port
- ‘8080’
- --scheme
- http
image:
cr.weaviate.io/semitechnologies/weaviate:1.26.1
ports:
- 8080:8080
- 6060:6060
- 50051:50051
- 7100:7100
- 7101:7101
- 8300:8300
restart: on-failure:0
volumes:
- ./data-node-1:/var/lib/weaviate
environment:
LOG_LEVEL: ‘debug’
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: ‘/var/lib/weaviate’
ENABLE_MODULES: ‘text2vec-openai,text2vec-cohere,text2vec-huggingface,text2vec-ollama,generative-ollama’
DEFAULT_VECTORIZER_MODULE: ‘none’
CLUSTER_HOSTNAME: ‘node1’
CLUSTER_GOSSIP_BIND_PORT: ‘7100’
CLUSTER_DATA_BIND_PORT: ‘7101’
RAFT_JOIN: ‘node1,node2,node3’
RAFT_BOOTSTRAP_EXPECT: 3
extra_hosts:
- “node1:192.168.1.52”
- “node2:192.168.1.23”
- “node3:192.168.1.24”

worker node
version: ‘3.7’
services:
node2:
container_name: node2
command:
- --host
- 0.0.0.0
- --port
- ‘8080’
- --scheme
- http
image:
cr.weaviate.io/semitechnologies/weaviate:1.26.1
ports:
- 8081:8080
- 6061:6060
- 50052:50051
- 7102:7102
- 7103:7103
- 8300:8300
restart: on-failure:0
volumes:
- ./data-node-2:/var/lib/weaviate
environment:
LOG_LEVEL: ‘debug’
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: ‘true’
PERSISTENCE_DATA_PATH: ‘/var/lib/weaviate’
ENABLE_MODULES: ‘text2vec-openai,text2vec-cohere,text2vec-huggingface,text2vec-ollama,generative-ollama’
DEFAULT_VECTORIZER_MODULE: ‘none’
CLUSTER_HOSTNAME: ‘node2’
CLUSTER_GOSSIP_BIND_PORT: ‘7102’
CLUSTER_DATA_BIND_PORT: ‘7103’
CLUSTER_JOIN: ‘node1:7100’
RAFT_JOIN: ‘node1,node2,node3’
RAFT_BOOTSTRAP_EXPECT: 3
extra_hosts:
- “node1:192.168.1.52”
- “node2:192.168.1.23”
- “node3:192.168.1.24”

master node error
{“action”:“telemetry_push”,“level”:“info”,“msg”:“telemetry started”,“payload”:“\u0026{MachineID:be8c92ab-97fa-476a-a41d-6ce7172d97f6 Type:INIT Version:1.26.1 NumObjects:0 OS:linux Arch:amd64 UsedModules:}”,“time”:“2024-08-13T10:24:36Z”}
{“level”:“debug”,“msg”:" memberlist: Stream connection from=192.168.1.23:59404",“time”:“2024-08-13T10:24:38Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node2 (timeout reached)“,“time”:“2024-08-13T10:24:40Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node2 has failed, no acks received",“time”:“2024-08-13T10:24:41Z”}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-13T10:24:41Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node2 (timeout reached)“,“time”:“2024-08-13T10:24:42Z”}
{“level”:“debug”,“msg”:” memberlist: Stream connection from=192.168.1.24:35312",“time”:“2024-08-13T10:24:42Z”}
{“level”:“info”,“msg”:" memberlist: Suspect node2 has failed, no acks received",“time”:“2024-08-13T10:24:44Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node2 (timeout reached)“,“time”:“2024-08-13T10:24:44Z”}
{“level”:“info”,“msg”:” memberlist: Marking node2 as failed, suspect timeout reached (0 peer confirmations)“,“time”:“2024-08-13T10:24:45Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node2 has failed, no acks received",“time”:“2024-08-13T10:24:47Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node3 (timeout reached)“,“time”:“2024-08-13T10:24:47Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node3 has failed, no acks received",“time”:“2024-08-13T10:24:51Z”}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-13T10:24:51Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node3 (timeout reached)“,“time”:“2024-08-13T10:24:52Z”}
{“level”:“info”,“msg”:” memberlist: Marking node3 as failed, suspect timeout reached (0 peer confirmations)“,“time”:“2024-08-13T10:24:55Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node3 has failed, no acks received",“time”:“2024-08-13T10:24:57Z”}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-13T10:25:01Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}
{“action”:“restapi_request”,“level”:“debug”,“method”:“GET”,“msg”:“received HTTP request”,“time”:“2024-08-13T10:25:11Z”,“url”:{“Scheme”:“”,“Opaque”:“”,“User”:null,“Host”:“”,“Path”:“/metrics”,“RawPath”:“”,“OmitHost”:false,“ForceQuery”:false,“RawQuery”:“”,“Fragment”:“”,“RawFragment”:“”}}

worker node error
{“action”:“bootstrap”,“error”:“could not join a cluster from [172.20.0.2:8300 172.20.0.2:8300]”,“level”:“warning”,“msg”:“failed to join cluster, will notify next if voter”,“servers”:[“172.20.0.2:8300”,“172.20.0.2:8300”],“time”:“2024-08-13T10:20:29Z”,“voter”:true}
{“action”:“bootstrap”,“expect”:3,“got”:{“node2”:“172.20.0.2:8300”},“level”:“debug”,“msg”:“number of candidates lower than bootstrap expect param, stopping notify”,“time”:“2024-08-13T10:20:29Z”}
{“action”:“bootstrap”,“expect”:3,“got”:{“node2”:“172.20.0.2:8300”},“level”:“debug”,“msg”:“number of candidates lower than bootstrap expect param, stopping notify”,“time”:“2024-08-13T10:20:29Z”}
{“action”:“bootstrap”,“level”:“info”,“msg”:“notified peers this node is ready to join as voter”,“servers”:[“172.20.0.2:8300”,“172.20.0.2:8300”],“time”:“2024-08-13T10:20:29Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node1 (timeout reached)“,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“last-leader-addr”:”“,“last-leader-id”:”“,“level”:“warning”,“msg”:“raft heartbeat timeout reached, starting election”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“level”:“info”,“msg”:“raft entering candidate state”,“node”:{},“term”:9,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“id”:“node2”,“level”:“debug”,“msg”:“raft voting for self”,“term”:9,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“level”:“debug”,“msg”:“raft calculated votes needed”,“needed”:1,“term”:9,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“from”:“node2”,“level”:“debug”,“msg”:“raft vote granted”,“tally”:1,“term”:9,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“level”:“info”,“msg”:“raft election won”,“tally”:1,“term”:9,“time”:“2024-08-13T10:20:30Z”}
{“action”:“raft”,“leader”:{},“level”:“info”,“msg”:“raft entering leader state”,“time”:“2024-08-13T10:20:30Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node1 has failed, no acks received",“time”:“2024-08-13T10:20:30Z”}
{“action”:“inverted filter2search migration”,“level”:“debug”,“msg”:“migration skip flag set, skipping migration”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“inverted filter2search migration”,“level”:“debug”,“msg”:“starting switching fallback mode”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“inverted filter2search migration”,“level”:“debug”,“msg”:“no missing filterable indexes, fallback mode skipped”,“time”:“2024-08-13T10:20:30Z”}
{“docker_image_tag”:“1.26.1”,“level”:“info”,“msg”:“configured versions”,“server_version”:“1.26.1”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“grpc_startup”,“level”:“info”,“msg”:“grpc server listening at [::]:50051”,“time”:“2024-08-13T10:20:30Z”}
{“address”:“172.20.0.2:8300”,“level”:“info”,“msg”:“current Leader”,“time”:“2024-08-13T10:20:30Z”}
{“level”:“info”,“msg”:“starting migration from old schema”,“time”:“2024-08-13T10:20:30Z”}
{“level”:“info”,“msg”:“legacy schema is empty, nothing to migrate”,“time”:“2024-08-13T10:20:30Z”}
{“level”:“info”,“msg”:“migration from the old schema has been successfully completed”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“restapi_management”,“docker_image_tag”:“1.26.1”,“level”:“info”,“msg”:“Serving weaviate at
http://[::]:8080”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“bootstrap”,“level”:“info”,“msg”:“node reporting ready, node has probably recovered cluster from raft config. Exiting bootstrap process”,“time”:“2024-08-13T10:20:30Z”}
{“action”:“telemetry_push”,“level”:“info”,“msg”:“telemetry started”,“payload”:“\u0026{MachineID:12cdb98a-1363-468d-9eb3-830c2b0b4d58 Type:INIT Version:1.26.1 NumObjects:0 OS:linux Arch:amd64 UsedModules:}”,“time”:“2024-08-13T10:20:31Z”}
{“level”:“debug”,“msg”:" memberlist: Failed UDP ping: node1 (timeout reached)“,“time”:“2024-08-13T10:20:32Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node1 has failed, no acks received",“time”:“2024-08-13T10:20:33Z”}
{“level”:“info”,“msg”:" memberlist: Marking node1 as failed, suspect timeout reached (0 peer confirmations)“,“time”:“2024-08-13T10:20:34Z”}
{“level”:“debug”,“msg”:” memberlist: Failed UDP ping: node1 (timeout reached)“,“time”:“2024-08-13T10:20:35Z”}
{“level”:“info”,“msg”:” memberlist: Suspect node1 has failed, no acks received",“time”:“2024-08-13T10:20:37Z”}

Can you make sure there is no firewall between those servers?

From error logs it seems to have some timeout.

Can you run the 3 node cluster in the same docker compose?

Maybe you can run it like this first, then try separating a node from the same docker compose.

In the same server cluster of 3 is working fine but when I use different servers we are getting this issue. There is no firewall blocking

telnet 192.168.1.52 8300
Trying 192.168.1.52…
Connected to 192.168.1.52.
Escape character is ‘^]’.

this is from node 2 to node 1