Issue with weaviate shards and ingestion

Description

I noticed when we ingest data in weaviate cluster from time to time some nodes/pods will loose connectivity and I will see messages like this:

{"level":"error","msg":"\"10.89.161.62:7001\": connect: Post \"http://10.89.161.62:7001/replicas/indices/..../shards/CdbqEZkp7uPZ/objects?request_id=weaviate-0-64-191a5978c3f-ce00c\u0026schema_version=34\": dial tcp 10.89.161.62:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:20:58Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/8x9pGGpLfof6/objects?request_id=weaviate-0-64-191a5978c3f-ce002\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:20:58Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/05i1dRqNYiPq/objects?request_id=weaviate-0-64-191a597aa92-cfa26\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:21:05Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/URlrEqICLG0s/objects?request_id=weaviate-0-64-191a597aa91-cfa06\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:21:05Z"}
{"action":"raft","fields.time":514781564,"level":"warning","msg":"raft failed to contact","server-id":"weaviate-1","time":"2024-08-30T23:21:11Z"}
{"action":"raft","fields.time":512243048,"level":"warning","msg":"raft failed to contact","server-id":"weaviate-2","time":"2024-08-30T23:21:17Z"}

After some time things normalizes. During the process of ingestion our resource utilization are normal. We are not hitting any resource limitation of the pods. So for now no need to scale up resources of the pods.

Questions:

  1. Is there any useful tunning option or optimization we can do on weaviate which might help us about this situation (so far we are with pretty default config / we have already upgraded our clients to v4 and grpc)? We’ll see what we can improve on the ingestion pipeline but I was curious if I can tune up something on weaviate itself too.

  2. On a separate topic what prometheus/grafana metrics I can find for weaviate to monitor shards and connections? I cant seem to find any? What’s exposed by weaviate or where I can see/check whats available to me?

Server Setup Information

  • Weaviate Server Version: 1.25.0
  • Deployment Method: deployed with helm in k8 /aws eks
  • Multi Node? Number of Running Nodes: 7 nodes (3 replication factor for our collection; multiple shards)
  • Client Language and Version: python with version 3 and version 4 (grpc)

Any additional Information

hi @ivan075 !!

Welcome to our community :hugs:

Regarding tuning for optimization, there are some options as stated here: Resource Planning | Weaviate

Ingesting data is a heavy CPU bound process, so it is interesting to keep an eye on that resource usage and make sure you can ingest at a rate that doesn’t overwhelm the resources you have.

Consider that at ingestion, Weaviate will not only receive the data, write it into DB but also index your vector (for similarity search) and inverted index (for keyword search).

Considering your error logs, it seems that that node is having a hard time connecting to nodes weaviate-1 and weaviate-2.

Do you have any resource reading on those at that timestamp?

Regarding monitoring, here is the documentation on what are the current metrics we can scrap using prometheus:

Let me know if this helps!

Thanks!