Description
I noticed when we ingest data in weaviate cluster from time to time some nodes/pods will loose connectivity and I will see messages like this:
{"level":"error","msg":"\"10.89.161.62:7001\": connect: Post \"http://10.89.161.62:7001/replicas/indices/..../shards/CdbqEZkp7uPZ/objects?request_id=weaviate-0-64-191a5978c3f-ce00c\u0026schema_version=34\": dial tcp 10.89.161.62:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:20:58Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/8x9pGGpLfof6/objects?request_id=weaviate-0-64-191a5978c3f-ce002\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:20:58Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/05i1dRqNYiPq/objects?request_id=weaviate-0-64-191a597aa92-cfa26\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:21:05Z"}
{"level":"error","msg":"\"10.89.139.73:7001\": connect: Post \"http://10.89.139.73:7001/replicas/indices/..../shards/URlrEqICLG0s/objects?request_id=weaviate-0-64-191a597aa91-cfa06\u0026schema_version=34\": dial tcp 10.89.139.73:7001: connect: connection refused","op":"broadcast","time":"2024-08-30T23:21:05Z"}
{"action":"raft","fields.time":514781564,"level":"warning","msg":"raft failed to contact","server-id":"weaviate-1","time":"2024-08-30T23:21:11Z"}
{"action":"raft","fields.time":512243048,"level":"warning","msg":"raft failed to contact","server-id":"weaviate-2","time":"2024-08-30T23:21:17Z"}
After some time things normalizes. During the process of ingestion our resource utilization are normal. We are not hitting any resource limitation of the pods. So for now no need to scale up resources of the pods.
Questions:
-
Is there any useful tunning option or optimization we can do on weaviate which might help us about this situation (so far we are with pretty default config / we have already upgraded our clients to v4 and grpc)? We’ll see what we can improve on the ingestion pipeline but I was curious if I can tune up something on weaviate itself too.
-
On a separate topic what prometheus/grafana metrics I can find for weaviate to monitor shards and connections? I cant seem to find any? What’s exposed by weaviate or where I can see/check whats available to me?
Server Setup Information
- Weaviate Server Version: 1.25.0
- Deployment Method: deployed with helm in k8 /aws eks
- Multi Node? Number of Running Nodes: 7 nodes (3 replication factor for our collection; multiple shards)
- Client Language and Version: python with version 3 and version 4 (grpc)