Horizontal Scaling or Upgrade issue - Weaviate cluster

Hi Team,

Seeing below error when we try to upgrade weaviate cluster(changing the image tag version) or perform scaling(changing the replicas value).

{"deprecation":{"apiType":"Configuration","id":"config-files","locations":["--config-file=\"\""],"mitigation":"Configure Weaviate using environment variables.","msg":"use of deprecated command line argument --config-file","sinceTime":"2020-09-08T09:46:00.000Z","sinceVersion":"0.22.16","status":"deprecated"},"level":"warning","msg":"use of deprecated command line argument --config-file","time":"2024-04-22T03:18:12Z"}
{"action":"startup","default_vectorizer_module":"none","level":"info","msg":"the default vectorizer modules is set to \"none\", as a result all new schema classes without an explicit vectorizer setting, will use this vectorizer","time":"2024-04-22T03:18:12Z"}
{"action":"startup","auto_schema_enabled":true,"level":"info","msg":"auto schema enabled setting is set to \"true\"","time":"2024-04-22T03:18:12Z"}
{"level":"info","msg":"No resource limits set, weaviate will use all available memory and CPU. To limit resources, set LIMIT_RESOURCES=true","time":"2024-04-22T03:18:12Z"}
{"action":"broadcast_abort_transaction","error":"host \"****:7001\": unexpected status code 401: ","id":"97fec38d-b981-40db-a038-7b70e72595f0","level":"error","msg":"broadcast tx abort failed","time":"2024-04-22T03:18:12Z"}
{"action":"startup","error":"could not load or initialize schema: sync schema with other nodes in the cluster: read schema: open transaction: broadcast open transaction: host \"****:7001\": unexpected status code 401 ()","level":"fatal","msg":"could not initialize schema manager","time":"2024-04-22T03:18:12Z"}

New controller is getting created automatically and trying to perform the operation in rolling fashion.

But pods doesn’t come up because of above errors.

Note ***

When i completely delete the Statefulset, weaviate scaling or upgrade works fine !! But we are looking for rolling update. Let me know if any changes to values.yaml to be done .

Regards,
Adithya

hi @adithya.ch ! I have changed the category of this thread to Support

This happens both when you upgrade and try to scale? No sure I understood this part.

Can you reproduce this on a test environment?
What versions are you upgrading from and to?

I assume you are just change the version or the replicas in the values of our helm chart, right?

Let me know those info so we figure this out.

Thanks!

Hello @DudaNogueira

yes, we are seeing the same error while we upgrade or while we try to scale.

We just changed the image tag version from 1.24.3 to 1.24.10 for upgrade. and replicas parameter from 3 to 5 for scaling.

I am testing all the above actions in test k8s cluster.

Error

Back-off restarting failed container weaviate in pod weaviate-5_vector(a4619704-4d70-4951-9afc-995601ad0045)

{"action":"broadcast_abort_transaction","error":"host \"10.36.6.100:7001\": unexpected status code 401: ","id":"6b95cb70-2d84-45c3-acd7-bfd6a17c3b55","level":"error","msg":"broadcast tx abort failed","time":"2024-04-22T13:08:48Z"}

{"action":"startup","error":"could not load or initialize schema: sync schema with other nodes in the cluster: read schema: open transaction: broadcast open transaction: host \"*****:7001\": unexpected status code 401 ()","level":"fatal","msg":"could not initialize schema manager","time":"2024-04-22T13:08:48Z"}

Regards,
Adithya

Also node status doesn’t show the newly added nodes.

nodes_status = client.cluster.get_nodes_status()
print(nodes_status)
[{‘batchStats’: {‘queueLength’: 0, ‘ratePerSecond’: 0}, ‘gitHash’: ‘86660ba’, ‘name’: ‘weaviate-0’, ‘shards’: None, ‘status’: ‘HEALTHY’, ‘version’: ‘1.24.10’}, {‘batchStats’: {‘queueLength’: 0, ‘ratePerSecond’: 0}, ‘gitHash’: ‘86660ba’, ‘name’: ‘weaviate-1’, ‘shards’: None, ‘status’: ‘HEALTHY’, ‘version’: ‘1.24.10’}, {‘batchStats’: {‘queueLength’: 0, ‘ratePerSecond’: 0}, ‘gitHash’: ‘86660ba’, ‘name’: ‘weaviate-2’, ‘shards’: None, ‘status’: ‘HEALTHY’, ‘version’: ‘1.24.10’}, {‘batchStats’: {‘queueLength’: 0, ‘ratePerSecond’: 0}, ‘gitHash’: ‘86660ba’, ‘name’: ‘weaviate-3’, ‘shards’: None, ‘status’: ‘HEALTHY’, ‘version’: ‘1.24.10’}, {‘batchStats’: {‘queueLength’: 0, ‘ratePerSecond’: 0}, ‘gitHash’: ‘86660ba’, ‘name’: ‘weaviate-4’, ‘shards’: None, ‘status’: ‘HEALTHY’, ‘version’: ‘1.24.10’}]

Here i have changed replicas from 5 to 7

Ideally weaviate-5 and weaviate-6 should be showing unhealthy from the above command.

Regards,
Adithya

Hello @DudaNogueira

Any suggestions on how to fix the issue. I see multiple old posts with same error nut don’t see the solution.

Regards,
Adithya

Hi @DudaNogueira , Let me know if there us any update on the above mentioned issue

Thank you

hi @adithya.ch !

Does it persist?

Not sure how to fix this.

Can you provide some step by step to reproduce? Then I can try to achieve this situation myself and explore more.

Thanks!

Hello @DudaNogueira

Yes, still the issue exists.

It’s similar to Rolling Update Not Working

We use ArgoCD to deploy the resouces in Openshift cluster and we deployed using helm chart after changing the variables in the values.yaml file for vertical/horizontal scaling.

  1. Downloaded the helm chart (templates / chart.yaml / values.yaml)

in gitops config

targetRevision: develop
applicationConfig:
path: vector-rcdn
name: vector-rcdn
namespace: vector
helm:
valueFiles:
- values.yaml
releaseName: weaviate-helm-rcdn

based on this config ArgoCD deploying the code changes to k8s cluster.

Changed the value of replicas for horizontal scaling.

Attached the list of steps with errors.



Regards,
Adithya

Hey @adithya.ch , @DudaNogueira !

Not sure I can add any new context to this issue, but basically having the exact same problem, also using ArgoCD and Weaviate Helm chart and rolling update enabled.

When I do any change to the helm chart values that triggers new deployment, weaviate goes into infinite crashloop. It can only recover with manual intervention - killing all (2 in my case) weaviate pods manually helps.

Did anyone manage to solve this?

Best regards,
Andrii

hi @andrii !

Is this a new deployment using the helm chart or a migrated from before 1.25 one?

There is some changes that is about this:

Let me know if this helps.

Thanks @DudaNogueira !

I have updated to weaviate 1.26.3 and chart version 17.1.1, increased the replicas to 3.

I am still having an issue with multiple replicas. Here is an example:
I have weaviate-0 (leader), weaviate-1 and weaviate-2 (followers) pods.

When kubernetes kills weaviate-2 to move to another node, weaviate-2 fails to join the cluster. I am seeing this in the leader’s logs:

{"action":"raft-net","build_git_commit":"9a4ea6d","build_go_version":"go1.21.13","build_image_tag":"1.26.3","build_wv_version":"1.26.3","error":"could not resolve server id weaviate-2","fallback":"10.0.174.130:8300","id":"weaviate-2","level":"warning","msg":"raft-net unable to get address for server, using fallback address","time":"2024-09-02T09:57:46Z"}
{"action":"raft","backoff time":500000000,"build_git_commit":"9a4ea6d","build_go_version":"go1.21.13","build_image_tag":"1.26.3","build_wv_version":"1.26.3","error":"dial tcp 10.0.173.129:8300: connect: connection refused","level":"error","msg":"raft failed to heartbeat to","peer":"10.0.174.130:8300","time":"2024-09-02T09:59:35Z"}

This can be resolved by deleting and recreating the stateful set.

Appreciate any feedback
Best regards

Here is the list of RAFT_ and CLUSTER_ related env variables that the chart set for the stateful set, is there anything missing that I need to add manually?


            - name: CLUSTER_DATA_BIND_PORT
              value: '7001'
            - name: CLUSTER_GOSSIP_BIND_PORT
              value: '7000'
            - name: RAFT_JOIN
              value: 'weaviate-0,weaviate-1,weaviate-2'
            - name: RAFT_BOOTSTRAP_EXPECT
              value: '3'
            - name: CLUSTER_JOIN
              value: weaviate-headless.flux-k8s.svc.cluster.local.

Hi! His this ip actually pointing to a node?

It seems it is not able to connect to that :thinking:

Hey @DudaNogueira,

Sorry for going silent, and thank you for your assistance. At the risk of sounding a bit ridiculous, I can no longer reproduce the issue. I’m not sure why it happened in the first place or why it stopped, but everything seems fine now.

1 Like

Oh, glad to hear that, @andrii !!

In case you run at any other issues, we’ll be here to help you :slight_smile:

Thanks!

There is now what looks like a different issue, caused by the same process of replacing one pod out of the cluster.
Still same setup: 3 replicas, collection has a replication factor of 3 and async replication enabled. When pod weaviate-1 got replaced, it now has this message in logs:

{"action":"async_replication","build_git_commit":"9a4ea6d","build_go_version":"go1.21.13","build_image_tag":"1.26.3","build_wv_version":"1.26.3","class_name":"AndriiTest","hashbeat_iteration":51,"level":"warning","msg":"hashbeat iteration failed: collecting differences: \"10.0.198.209:7001\": status code: 401, error: ","shard_name":"554JMVbS6e2m","time":"2024-09-05T17:25:51Z"}

When I try to call v1/cluster/statistics endpoint I am getting this error back:

{
    "error": [
        {
            "message": "node: weaviate-1: unexpected status code 401 ()"
        }
    ]
}

Restarting the ReplicaSet helps, but it seems to not be able to recover on its own