Issue with weaviate shards and ingestion

ivan075 · August 30, 2024, 11:50pm

Description

I noticed when we ingest data in weaviate cluster from time to time some nodes/pods will loose connectivity and I will see messages like this:

{"level":"error","msg":"\"10.89.161.62:7001\": connect: Post \"http://10.89.161.62:7001/replicas/indices/..../shards/CdbqEZkp7uPZ/objects?request_id=weaviate-0-64-191a5978c3f-ce00c\u0026schema_version=34\": dial tcp 10.89.161.62:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:20:58Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/8x9pGGpLfof6/objects?request_id=weaviate-0-64-191a5978c3f-ce002\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:20:58Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/05i1dRqNYiPq/objects?request_id=weaviate-0-64-191a597aa92-cfa26\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:21:05Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/URlrEqICLG0s/objects?request_id=weaviate-0-64-191a597aa91-cfa06\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:21:05Z"}
{"action":"raft","fields.time":514781564,"level":"warning","msg":"raft failed to contact","server-id":"weaviate-1","time":"2024-08-30T23:21:11Z"}
{"action":"raft","fields.time":512243048,"level":"warning","msg":"raft failed to contact","server-id":"weaviate-2","time":"2024-08-30T23:21:17Z"}

After some time things normalizes. During the process of ingestion our resource utilization are normal. We are not hitting any resource limitation of the pods. So for now no need to scale up resources of the pods.

Questions:

Is there any useful tunning option or optimization we can do on weaviate which might help us about this situation (so far we are with pretty default config / we have already upgraded our clients to v4 and grpc)? We’ll see what we can improve on the ingestion pipeline but I was curious if I can tune up something on weaviate itself too.
On a separate topic what prometheus/grafana metrics I can find for weaviate to monitor shards and connections? I cant seem to find any? What’s exposed by weaviate or where I can see/check whats available to me?

Server Setup Information

Weaviate Server Version: 1.25.0
Deployment Method: deployed with helm in k8 /aws eks
Multi Node? Number of Running Nodes: 7 nodes (3 replication factor for our collection; multiple shards)
Client Language and Version: python with version 3 and version 4 (grpc)

Any additional Information

DudaNogueira · September 2, 2024, 12:19pm

hi @ivan075 !!

Welcome to our community

Regarding tuning for optimization, there are some options as stated here: Resource Planning | Weaviate

Ingesting data is a heavy CPU bound process, so it is interesting to keep an eye on that resource usage and make sure you can ingest at a rate that doesn’t overwhelm the resources you have.

Consider that at ingestion, Weaviate will not only receive the data, write it into DB but also index your vector (for similarity search) and inverted index (for keyword search).

Considering your error logs, it seems that that node is having a hard time connecting to nodes weaviate-1 and weaviate-2.

Do you have any resource reading on those at that timestamp?

Regarding monitoring, here is the documentation on what are the current metrics we can scrap using prometheus:

Let me know if this helps!

Thanks!

Topic		Replies	Views
Multi nodes running unnormally Support	5	447	December 3, 2024
Weaviate Self host pods are not stable in Production, local index not found and shard errors Support	2	138	May 26, 2025
Production weaviate 24.6 crashed Support technical	1	495	March 8, 2025
Single Weaviate unable to connect Support	1	247	November 18, 2024
Helm cluster, node spontaneously stops and restarts, shard unavailable for 1 min Support	5	593	December 4, 2023

Issue with weaviate shards and ingestion

Description

Server Setup Information

Any additional Information

Related topics