Cluster inconsistency and cpu usage

Hi, we ran into problems with a cluster with 3 replicas.

The 3d replica failed with the following error logs:

{"action":"startup_cluster_schema_sync","diff":["local has 322 classes, but cluster has 321 classes","class C_651fdeebf176f871a80562de_enUS exists in local, but not in cluster"],"level":"error","msg":"mismatch between local schema and remote (other nodes consensus) schema","time":"2023-10-06T10:47:47Z"}
{"action":"startup","error":"could not load or initialize schema: sync schema with other nodes in the cluster: corrupt cluster: other nodes have consensus on schema, but local node has a different (non-null) schema","level":"fatal","msg":"could not initialize schema manager","time":"2023-10-06T10:47:47Z"}

Scaling down to 2 and scaling up again didn’t resolve the issue. We ended up deleting the volume of this 3d replica. Everything seemed to be fine after.

Related or not, sometime later we had the following problem in this cluster: the first 2 replicas have a constant cpu usage (and cpu throttling) of 100%. The 3d replica has very low cpu usage. Updates and search still work.

In all 3 replicas we see a lot of “context cancelled” errors, e.g.:

2023-10-10T09:35:08+02:00 {"level":"error","msg":"\"10.7.64.14:7001\": connect: Patch \"http://10.7.64.14:7001/replicas/indices/C_6524ea3823742848c7547e53_enUS/shards/TIhpJZ5a7rXp/objects/85478e9d-3a5a-4a16-ab27-c6c0535b5fa1?request_id=weaviate-1-02-18b1882d2b5-164\": context canceled","op":"broadcast","time":"2023-10-10T07:35:08Z"}
2023-10-10T09:57:44+02:00 {"level":"error","msg":"\"10.7.64.14:7001\": connect: Post \"http://10.7.64.14:7001/replicas/indices/C_6524f7c7e1f5116d8f8ff154_enUS/shards/FYmnc47ubeYR/objects?request_id=weaviate-1-01-18b18978205-1\": context canceled","op":"broadcast","time":"2023-10-10T07:57:44Z"}

Any advice on how we can solve these problems?

We noticed a misconfiguration on our end: we use helm chart v16.1.0, but image tag 1.19.0.
Not sure if the issues are caused by this.

Thanks!

Hi @KCog ! Welcome to our community! :hugs:

I am not sure, but scaling down may lead to different other issues. We have a nice repo that tries to replicate those scenarios:

I have asked internally about what one should do when faced with class mismatch and hope to get back here with more info when possible.

Thanks!