Cluster nodes not aware of eachother

hi folks,

Seeing an issue with my weavaite 1.18.3 cluster nodes not being aware of each other. I have 3 replicas running, but when I issue a v1/cluster request, each node I hit only seems to know about itself (running the same curl 3 times results in round-robining).

Initially the CLUSTER_JOIN environment variable was set to the wrong service name, but now I’ve set this to $HEADLESS_SERVICE_NAME.$NAMESPACE.svc.cluster.local (I had to rename the weaviate-headless service), which I believe should be correct, and the pods have been re-started with this value set on their env. However, this hasn’t fixed anything. Is cluster node state stored in the persistent volume?

Does anyone have any advice on how to go about debugging this issue?

Thanks!

Have you used our helm charts to deploy Weaviate to your K8s? bc as far as I remember cluster creation should work out of the box and the names of the services should be also correct when using our helm charts, so there’s no need for service renaming.

Could you share with us some logs? or maybe more specific do you happen to see any errors coming from memberlist?

Hi! sorry for the radio silence on this. We dug back into the problem and identified the root issue relating to the interactions between:

  • Version of Kubernetes we were running
  • The musl DNS resolver used in the Alpine base image of weaviate
  • The lack of FQDNs used in the weaviate helm chart for inter-node communication.

I’ve created a github issue here to discuss possible fixes: Intra-cluster hostnames need to be FQDNs for musl to resolve DNS in all configurations · Issue #175 · weaviate/weaviate-helm · GitHub

In short, if anyone finds this issue occurring:

weaviate images are built on Alpine Linux which uses the musl DNS resolver, rather than the more standard libc based one.

musl can behave weirdly when DNS isn’t configured specifically for it in the K8s environment.

The solution to the problem is for weaviate to only use FQDNs for communicating between nodes, which in principle means changing all the CLUSTER_JOIN environment variables to be weaviate-headless.{{release.namespace}}.svc.cluster.local. << The extra . on the end is the fix.

1 Like

Hi, I am wondering when you experienced this issue, did you have istio proxy on the pods? I am experience a similar issue but the suggested fix here did not resolve it. Wondering it if it could be due to the istio side car proxy on pods interfering with calls between pods. Not sure how to get around this regardless. Any help would be appreciated.

1 Like

I am experiencing this issue with istio proxy on the pods and none of the fixes I found helped solve the problem. I have istio with mTLS PeerAuthentication in PERMISSIVE mode and I am seeing very inconsistent results where sometimes the nodes in the cluster find each other and sometimes they don’t. I am wondering if you were able to find a solution?