Cluster nodes not aware of eachother

Lewiky · June 22, 2023, 3:02pm

hi folks,

Seeing an issue with my weavaite 1.18.3 cluster nodes not being aware of each other. I have 3 replicas running, but when I issue a v1/cluster request, each node I hit only seems to know about itself (running the same curl 3 times results in round-robining).

Initially the CLUSTER_JOIN environment variable was set to the wrong service name, but now I’ve set this to $HEADLESS_SERVICE_NAME.$NAMESPACE.svc.cluster.local (I had to rename the weaviate-headless service), which I believe should be correct, and the pods have been re-started with this value set on their env. However, this hasn’t fixed anything. Is cluster node state stored in the persistent volume?

Does anyone have any advice on how to go about debugging this issue?

Thanks!

antas-marcin · June 26, 2023, 8:55am

Have you used our helm charts to deploy Weaviate to your K8s? bc as far as I remember cluster creation should work out of the box and the names of the services should be also correct when using our helm charts, so there’s no need for service renaming.

Could you share with us some logs? or maybe more specific do you happen to see any errors coming from memberlist?

Lewiky · October 6, 2023, 1:18pm

Hi! sorry for the radio silence on this. We dug back into the problem and identified the root issue relating to the interactions between:

Version of Kubernetes we were running
The musl DNS resolver used in the Alpine base image of weaviate
The lack of FQDNs used in the weaviate helm chart for inter-node communication.

I’ve created a github issue here to discuss possible fixes: Intra-cluster hostnames need to be FQDNs for musl to resolve DNS in all configurations · Issue #175 · weaviate/weaviate-helm · GitHub

Lewiky · October 6, 2023, 1:22pm

In short, if anyone finds this issue occurring:

weaviate images are built on Alpine Linux which uses the musl DNS resolver, rather than the more standard libc based one.

musl can behave weirdly when DNS isn’t configured specifically for it in the K8s environment.

The solution to the problem is for weaviate to only use FQDNs for communicating between nodes, which in principle means changing all the CLUSTER_JOIN environment variables to be weaviate-headless.{{release.namespace}}.svc.cluster.local. << The extra . on the end is the fix.

Landon_Edwards · December 5, 2023, 1:01am

Hi, I am wondering when you experienced this issue, did you have istio proxy on the pods? I am experience a similar issue but the suggested fix here did not resolve it. Wondering it if it could be due to the istio side car proxy on pods interfering with calls between pods. Not sure how to get around this regardless. Any help would be appreciated.

kbenayed · January 29, 2024, 10:15pm

I am experiencing this issue with istio proxy on the pods and none of the fixes I found helped solve the problem. I have istio with mTLS PeerAuthentication in PERMISSIVE mode and I am seeing very inconsistent results where sometimes the nodes in the cluster find each other and sometimes they don’t. I am wondering if you were able to find a solution?

Topic		Replies	Views
[Question] synchronized:false but nodes can see each other Support technical	1	65	September 23, 2024
Facing cluster formation between pods of running weaviate using kubeadm on EC2 Support	3	72	February 17, 2025
Production Cluster weaviate multinode without K(S Support	9	239	April 22, 2025
Data Synchronization Between Two k8s Clusters Support	1	161	May 3, 2024
Helm cluster, node spontaneously stops and restarts, shard unavailable for 1 min Support	5	466	December 4, 2023

Cluster nodes not aware of eachother

Related topics