We’re experiencing intermittent timeout issues with Weaviate, occurring at least twice daily. During these events, adding documents via both /v1/batch/objects and /v1/objects endpoints becomes impossible. Increasing the timeout duration does not resolve the issue; requests hang indefinitely.
Interestingly, during these periods, search functionality remains unaffected and continues to perform well, indicating that vector creation itself is operational.
We initially encountered this behavior on version v1.29, testing with both asynchronous and synchronous replication modes. After downgrading back to v1.28, the issue persisted unchanged. Restarting the instance temporarily resolves the issue.
Our suspicion is that this relates to replication mechanisms, potentially involving write locks on indexes. A separate cluster configuration with three shards and no replication does not exhibit this problem and operates normally.
Server Setup Information
Weaviate Server Version: both v1.28.11 and v1.29.0
Deployment Method: k8s
Multi Node? Number of Running Nodes: 6 (3 shards, replicated once)
Client Language and Version: PHP using API
Multitenancy?: nope
Any additional Information
The logs look normal, no errors what so ever. We have tried to debug this but with out any success.
Can you use grpc? You are ingesting using PHP, meaning, using REST and not GRPC. This is certainly not optimal, and can be the bottleneck here. We did a lot of improvement in GRPC, specially for ingestion.
Can you try pushing some dataset using for example python and batch?
Do you have any readings on objects_durations_ms or batch_duration_ms? Monitoring | Weaviate
I’ve tried to setup the monitoring. And it was kindof a hassle since i needed to add all instances manually to the scrape config since we cannot user servicemonitors.
The example dashboards needed some adjustments to make sure we use the right data source. Unfortunatly your dashboards dont have a datasource selection dropdown.
I’m now seeing some data, but there is a LOT of n/a n/a in the different lists. Which makes me wonder if i configured them correctly.
We’ve been monitoring the instances closely over the past two weeks. Unfortunately, we haven’t observed any unusual metrics or signs pointing to the underlying issue. Specifically, during the moments before the instances stop accepting batch requests, all metrics appear completely normal. Even during these periods, nothing out of the ordinary stands out.
Do you have any additional recommendations on areas we could further investigate or specific metrics to focus on?
@joris Also, is your client connecting to Weaviate thru a reverse proxy?
We have seen situations where the reverse proxy may be timing out, so that would also explain Weaviate not showing any resource pressure yet it timing out
When i look at the metrics at that time it does seem like there is a small spike in avg(sum by() (rate(vector_index_queue_delete_count[1m0s]))) at that same moment.
Also a peak in “get_true_positive”, ofcourse also in query latency, which makes sense (90 percenile is like 50s). Also see a little peak in “Active Tombstones in HNSW index”.
Not sure what to do here.
Doet seem to be related to deletes, i see some timeouts every time there are some little peaks on deletes.