Cluster nodes not aware of eachother

hi folks,

Seeing an issue with my weavaite 1.18.3 cluster nodes not being aware of each other. I have 3 replicas running, but when I issue a v1/cluster request, each node I hit only seems to know about itself (running the same curl 3 times results in round-robining).

Initially the CLUSTER_JOIN environment variable was set to the wrong service name, but now I’ve set this to $HEADLESS_SERVICE_NAME.$NAMESPACE.svc.cluster.local (I had to rename the weaviate-headless service), which I believe should be correct, and the pods have been re-started with this value set on their env. However, this hasn’t fixed anything. Is cluster node state stored in the persistent volume?

Does anyone have any advice on how to go about debugging this issue?


Have you used our helm charts to deploy Weaviate to your K8s? bc as far as I remember cluster creation should work out of the box and the names of the services should be also correct when using our helm charts, so there’s no need for service renaming.

Could you share with us some logs? or maybe more specific do you happen to see any errors coming from memberlist?

Hi! sorry for the radio silence on this. We dug back into the problem and identified the root issue relating to the interactions between:

  • Version of Kubernetes we were running
  • The musl DNS resolver used in the Alpine base image of weaviate
  • The lack of FQDNs used in the weaviate helm chart for inter-node communication.

I’ve created a github issue here to discuss possible fixes: Intra-cluster hostnames need to be FQDNs for musl to resolve DNS in all configurations · Issue #175 · weaviate/weaviate-helm · GitHub

In short, if anyone finds this issue occurring:

weaviate images are built on Alpine Linux which uses the musl DNS resolver, rather than the more standard libc based one.

musl can behave weirdly when DNS isn’t configured specifically for it in the K8s environment.

The solution to the problem is for weaviate to only use FQDNs for communicating between nodes, which in principle means changing all the CLUSTER_JOIN environment variables to be weaviate-headless.{{release.namespace}}.svc.cluster.local. << The extra . on the end is the fix.