Write timeout in combination with replicas

Description

We’re experiencing intermittent timeout issues with Weaviate, occurring at least twice daily. During these events, adding documents via both /v1/batch/objects and /v1/objects endpoints becomes impossible. Increasing the timeout duration does not resolve the issue; requests hang indefinitely.

Interestingly, during these periods, search functionality remains unaffected and continues to perform well, indicating that vector creation itself is operational.

We initially encountered this behavior on version v1.29, testing with both asynchronous and synchronous replication modes. After downgrading back to v1.28, the issue persisted unchanged. Restarting the instance temporarily resolves the issue.

Our suspicion is that this relates to replication mechanisms, potentially involving write locks on indexes. A separate cluster configuration with three shards and no replication does not exhibit this problem and operates normally.

Server Setup Information

  • Weaviate Server Version: both v1.28.11 and v1.29.0
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 6 (3 shards, replicated once)
  • Client Language and Version: PHP using API
  • Multitenancy?: nope

Any additional Information

The logs look normal, no errors what so ever. We have tried to debug this but with out any success.

Anyone any ideas?

hi @joris !!

Welcome back :slight_smile:

Some questions:

Are you using ASYNC_INDEXING?

Can you use grpc? You are ingesting using PHP, meaning, using REST and not GRPC. This is certainly not optimal, and can be the bottleneck here. We did a lot of improvement in GRPC, specially for ingestion.

Can you try pushing some dataset using for example python and batch?

Do you have any readings on objects_durations_ms or batch_duration_ms? Monitoring | Weaviate

Let me know if that helps!

Hi @DudaNogueira ,

Thanks for the reply.

We are indeed using Async indexing. I’ll try to use gRPC the next time we encounter this issue.

We do not have the instance hooked up to a monitoring system. That’s a great suggestion and i’ll look into this.

To be continued!

1 Like

Great!!! Let us know about your findings :slight_smile:

Thanks!

I’ve tried to setup the monitoring. And it was kindof a hassle since i needed to add all instances manually to the scrape config since we cannot user servicemonitors.

The example dashboards needed some adjustments to make sure we use the right data source. Unfortunatly your dashboards dont have a datasource selection dropdown.

I’m now seeing some data, but there is a LOT of n/a n/a in the different lists. Which makes me wonder if i configured them correctly.


    - job_name: weaviate
      scrape_interval: 2s
      static_configs:
        - targets:
          - weaviate-0.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-1.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-2.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-3.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-4.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-5.weaviate-headless.weaviate-a.svc.cluster.local:2112

hi @Bjorn !!

Welcome to our community :hugs:

Indeed our Grafana dashboards needs some :heart:

We have that mapped out. We even have some new metrics being exposed.

By the way, there is an updated version of those dashboards here:

Thanks!

Hi @DudaNogueira,

We’ve been monitoring the instances closely over the past two weeks. Unfortunately, we haven’t observed any unusual metrics or signs pointing to the underlying issue. Specifically, during the moments before the instances stop accepting batch requests, all metrics appear completely normal. Even during these periods, nothing out of the ordinary stands out.

Do you have any additional recommendations on areas we could further investigate or specific metrics to focus on?

Thanks again for your support!

Hi @joris !!

Nothing on logs either? That’s strange.

Just to confirm, you are now using bartch/GRPC instead of directly sending thru rest, right?

We did not yet try GRPC instead of REST API. Let me get back to you after we tried GRPC

1 Like

@joris Also, is your client connecting to Weaviate thru a reverse proxy?

We have seen situations where the reverse proxy may be timing out, so that would also explain Weaviate not showing any resource pressure yet it timing out :thinking:

No, our clients are

We actaully use the internal service URL directly. So “http://weaviate.[namespace]” which is on the same cluster and nodes.

I have complete metrics by now. What should i look out for? Not too sure yet.

We just got another timout (cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see libcurl - Error Codes) for http://weaviate.weaviate-a/v1/graphql)

When i look at the metrics at that time it does seem like there is a small spike in avg(sum by() (rate(vector_index_queue_delete_count[1m0s]))) at that same moment.

Also a peak in “get_true_positive”, ofcourse also in query latency, which makes sense (90 percenile is like 50s). Also see a little peak in “Active Tombstones in HNSW index”.

Not sure what to do here.

Doet seem to be related to deletes, i see some timeouts every time there are some little peaks on deletes.

hi @Bjorn !

Do you see any outstanding logs when that timeout occurred?

the fact that it is 30s seems to be some default value.

are you doing the query directly using curl or using our python client to perform raw graphql queries?

We use GitHub - timkley/weaviate-php: A PHP client to communicate with a Weaviate instance.

Also possibly relevant is my thread on the slack: Slack