Write timeout in combination with replicas

Description

We’re experiencing intermittent timeout issues with Weaviate, occurring at least twice daily. During these events, adding documents via both /v1/batch/objects and /v1/objects endpoints becomes impossible. Increasing the timeout duration does not resolve the issue; requests hang indefinitely.

Interestingly, during these periods, search functionality remains unaffected and continues to perform well, indicating that vector creation itself is operational.

We initially encountered this behavior on version v1.29, testing with both asynchronous and synchronous replication modes. After downgrading back to v1.28, the issue persisted unchanged. Restarting the instance temporarily resolves the issue.

Our suspicion is that this relates to replication mechanisms, potentially involving write locks on indexes. A separate cluster configuration with three shards and no replication does not exhibit this problem and operates normally.

Server Setup Information

  • Weaviate Server Version: both v1.28.11 and v1.29.0
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 6 (3 shards, replicated once)
  • Client Language and Version: PHP using API
  • Multitenancy?: nope

Any additional Information

The logs look normal, no errors what so ever. We have tried to debug this but with out any success.

Anyone any ideas?

hi @joris !!

Welcome back :slight_smile:

Some questions:

Are you using ASYNC_INDEXING?

Can you use grpc? You are ingesting using PHP, meaning, using REST and not GRPC. This is certainly not optimal, and can be the bottleneck here. We did a lot of improvement in GRPC, specially for ingestion.

Can you try pushing some dataset using for example python and batch?

Do you have any readings on objects_durations_ms or batch_duration_ms? Monitoring | Weaviate

Let me know if that helps!

Hi @DudaNogueira ,

Thanks for the reply.

We are indeed using Async indexing. I’ll try to use gRPC the next time we encounter this issue.

We do not have the instance hooked up to a monitoring system. That’s a great suggestion and i’ll look into this.

To be continued!

1 Like

Great!!! Let us know about your findings :slight_smile:

Thanks!

I’ve tried to setup the monitoring. And it was kindof a hassle since i needed to add all instances manually to the scrape config since we cannot user servicemonitors.

The example dashboards needed some adjustments to make sure we use the right data source. Unfortunatly your dashboards dont have a datasource selection dropdown.

I’m now seeing some data, but there is a LOT of n/a n/a in the different lists. Which makes me wonder if i configured them correctly.


    - job_name: weaviate
      scrape_interval: 2s
      static_configs:
        - targets:
          - weaviate-0.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-1.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-2.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-3.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-4.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-5.weaviate-headless.weaviate-a.svc.cluster.local:2112

hi @Bjorn !!

Welcome to our community :hugs:

Indeed our Grafana dashboards needs some :heart:

We have that mapped out. We even have some new metrics being exposed.

By the way, there is an updated version of those dashboards here:

Thanks!

Hi @DudaNogueira,

We’ve been monitoring the instances closely over the past two weeks. Unfortunately, we haven’t observed any unusual metrics or signs pointing to the underlying issue. Specifically, during the moments before the instances stop accepting batch requests, all metrics appear completely normal. Even during these periods, nothing out of the ordinary stands out.

Do you have any additional recommendations on areas we could further investigate or specific metrics to focus on?

Thanks again for your support!

Hi @joris !!

Nothing on logs either? That’s strange.

Just to confirm, you are now using bartch/GRPC instead of directly sending thru rest, right?

We did not yet try GRPC instead of REST API. Let me get back to you after we tried GRPC

1 Like

@joris Also, is your client connecting to Weaviate thru a reverse proxy?

We have seen situations where the reverse proxy may be timing out, so that would also explain Weaviate not showing any resource pressure yet it timing out :thinking:

No, our clients are

We actaully use the internal service URL directly. So “http://weaviate.[namespace]” which is on the same cluster and nodes.

I have complete metrics by now. What should i look out for? Not too sure yet.

We just got another timout (cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see libcurl - Error Codes) for http://weaviate.weaviate-a/v1/graphql)

When i look at the metrics at that time it does seem like there is a small spike in avg(sum by() (rate(vector_index_queue_delete_count[1m0s]))) at that same moment.

Also a peak in “get_true_positive”, ofcourse also in query latency, which makes sense (90 percenile is like 50s). Also see a little peak in “Active Tombstones in HNSW index”.

Not sure what to do here.

Doet seem to be related to deletes, i see some timeouts every time there are some little peaks on deletes.

hi @Bjorn !

Do you see any outstanding logs when that timeout occurred?

the fact that it is 30s seems to be some default value.

are you doing the query directly using curl or using our python client to perform raw graphql queries?

We use GitHub - timkley/weaviate-php: A PHP client to communicate with a Weaviate instance.

Also possibly relevant is my thread on the slack: Slack

Hi!

Have you increased the timeout on the client?

We remembered we made that issue and increased the timeout there today. Funny how things go.

We also thpught about how to minimize deletions by doing smarter updating od chunks by comparing chunks and delte only changed chunks while updating. This should also eleviate the amount of deletions in the weaviate instances.

I suspect 2 things. Storage is relatively slow (ha network storage) which doesnt help. And we delete all chunks of a doument while updating which means peaks of delete and insert. We can make that less.

We also lowered the amount of deletions per cycle to something very low to see if that helps.

We’ll get there, but i think there might also be room to make this porcess beter in weaviate. Not sure if ypu give time for other processes between deletions or if it is one big transaction (or something similar). Givong time to other processes while doong cleanup by batching them and then forst do other stuff before doonng the next batch in the cycle could help. I must say i didnt look in the source to know how the process works.

Also notice that the mentioned PHP Client is using graphql for querying (and REST for deleting) while our newer clients are now shifting towards GRPC.

This will bring way more performance in different aspects.

Having ASYNC_INDEXING enabled will also help, as the client can get the response faster and doesn’t need to wait for the index part of the ingestion.

When possible, it is better to update an object than delete/insert.

Also feel free to join our weekly Community Office Hours:

Let me know if this helps!

Thanks!

We have async enabled yeah, but the reindex might run for a while and will be crossing over with a cleanup run.

We inveatigated grpc, but the protocol is a bit of a hassle in php.

In regards to updating, ill reconsider that with my colleague, see if it is viable. Not aure if we can do that easyly

There are some env vars around TOMBSTONE cleanup:

You may tweak that value so your cleanup cycles doesn’t take over your CPU, affecting other operation.