Cluster inconsistency and cpu usage

KCog · October 13, 2023, 12:09pm

Hi, we ran into problems with a cluster with 3 replicas.

The 3d replica failed with the following error logs:

{"action":"startup_cluster_schema_sync","diff":["local has 322 classes, but cluster has 321 classes","class C_651fdeebf176f871a80562de_enUS exists in local, but not in cluster"],"level":"error","msg":"mismatch between local schema and remote (other nodes consensus) schema","time":"2023-10-06T10:47:47Z"}
{"action":"startup","error":"could not load or initialize schema: sync schema with other nodes in the cluster: corrupt cluster: other nodes have consensus on schema, but local node has a different (non-null) schema","level":"fatal","msg":"could not initialize schema manager","time":"2023-10-06T10:47:47Z"}

Scaling down to 2 and scaling up again didn’t resolve the issue. We ended up deleting the volume of this 3d replica. Everything seemed to be fine after.

Related or not, sometime later we had the following problem in this cluster: the first 2 replicas have a constant cpu usage (and cpu throttling) of 100%. The 3d replica has very low cpu usage. Updates and search still work.

In all 3 replicas we see a lot of “context cancelled” errors, e.g.:

2023-10-10T09:35:08+02:00 {"level":"error","msg":"\"10.7.64.14:7001\": connect: Patch \"http://10.7.64.14:7001/replicas/indices/C_6524ea3823742848c7547e53_enUS/shards/TIhpJZ5a7rXp/objects/85478e9d-3a5a-4a16-ab27-c6c0535b5fa1?request_id=weaviate-1-02-18b1882d2b5-164\": context canceled","op":"broadcast","time":"2023-10-10T07:35:08Z"}
2023-10-10T09:57:44+02:00 {"level":"error","msg":"\"10.7.64.14:7001\": connect: Post \"http://10.7.64.14:7001/replicas/indices/C_6524f7c7e1f5116d8f8ff154_enUS/shards/FYmnc47ubeYR/objects?request_id=weaviate-1-01-18b18978205-1\": context canceled","op":"broadcast","time":"2023-10-10T07:57:44Z"}

Any advice on how we can solve these problems?

We noticed a misconfiguration on our end: we use helm chart v16.1.0, but image tag 1.19.0.
Not sure if the issues are caused by this.

Thanks!

DudaNogueira · October 17, 2023, 1:21pm

Hi @KCog ! Welcome to our community!

I am not sure, but scaling down may lead to different other issues. We have a nice repo that tries to replicate those scenarios:

I have asked internally about what one should do when faced with class mismatch and hope to get back here with more info when possible.

Thanks!

Topic		Replies	Views
Schema Sync Error Support	5	465	August 28, 2024
Use of CLUSTER_IGNORE_SCHEMA_SYNC General developer-experience	1	302	November 30, 2023
Nodes/shards objects imbalance Support technical	1	56	September 18, 2024
Weaviate cluster is very unstable (1.29.2) Support	8	179	April 9, 2025
Missing node or shard error Support	2	449	November 28, 2023

Cluster inconsistency and cpu usage

Related topics