Write timeout in combination with replicas

joris · March 20, 2025, 11:35am

Description

We’re experiencing intermittent timeout issues with Weaviate, occurring at least twice daily. During these events, adding documents via both /v1/batch/objects and /v1/objects endpoints becomes impossible. Increasing the timeout duration does not resolve the issue; requests hang indefinitely.

Interestingly, during these periods, search functionality remains unaffected and continues to perform well, indicating that vector creation itself is operational.

We initially encountered this behavior on version v1.29, testing with both asynchronous and synchronous replication modes. After downgrading back to v1.28, the issue persisted unchanged. Restarting the instance temporarily resolves the issue.

Our suspicion is that this relates to replication mechanisms, potentially involving write locks on indexes. A separate cluster configuration with three shards and no replication does not exhibit this problem and operates normally.

Server Setup Information

Weaviate Server Version: both v1.28.11 and v1.29.0
Deployment Method: k8s
Multi Node? Number of Running Nodes: 6 (3 shards, replicated once)
Client Language and Version: PHP using API
Multitenancy?: nope

Any additional Information

The logs look normal, no errors what so ever. We have tried to debug this but with out any success.

Anyone any ideas?

DudaNogueira · March 20, 2025, 1:00pm

hi @joris !!

Welcome back

Some questions:

Are you using ASYNC_INDEXING?

Can you use grpc? You are ingesting using PHP, meaning, using REST and not GRPC. This is certainly not optimal, and can be the bottleneck here. We did a lot of improvement in GRPC, specially for ingestion.

Can you try pushing some dataset using for example python and batch?

Do you have any readings on objects_durations_ms or batch_duration_ms? Monitoring | Weaviate

Let me know if that helps!

joris · March 20, 2025, 2:09pm

Hi @DudaNogueira ,

Thanks for the reply.

We are indeed using Async indexing. I’ll try to use gRPC the next time we encounter this issue.

We do not have the instance hooked up to a monitoring system. That’s a great suggestion and i’ll look into this.

To be continued!

DudaNogueira · March 20, 2025, 2:24pm

Great!!! Let us know about your findings

Thanks!

Bjorn · March 20, 2025, 9:37pm

I’ve tried to setup the monitoring. And it was kindof a hassle since i needed to add all instances manually to the scrape config since we cannot user servicemonitors.

The example dashboards needed some adjustments to make sure we use the right data source. Unfortunatly your dashboards dont have a datasource selection dropdown.

I’m now seeing some data, but there is a LOT of n/a n/a in the different lists. Which makes me wonder if i configured them correctly.


    - job_name: weaviate
      scrape_interval: 2s
      static_configs:
        - targets:
          - weaviate-0.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-1.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-2.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-3.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-4.weaviate-headless.weaviate-a.svc.cluster.local:2112
          - weaviate-5.weaviate-headless.weaviate-a.svc.cluster.local:2112

DudaNogueira · March 21, 2025, 1:24pm

hi @Bjorn !!

Welcome to our community

Indeed our Grafana dashboards needs some

We have that mapped out. We even have some new metrics being exposed.

By the way, there is an updated version of those dashboards here:

Thanks!

joris · April 7, 2025, 12:31pm

Hi @DudaNogueira,

We’ve been monitoring the instances closely over the past two weeks. Unfortunately, we haven’t observed any unusual metrics or signs pointing to the underlying issue. Specifically, during the moments before the instances stop accepting batch requests, all metrics appear completely normal. Even during these periods, nothing out of the ordinary stands out.

Do you have any additional recommendations on areas we could further investigate or specific metrics to focus on?

Thanks again for your support!

DudaNogueira · April 8, 2025, 3:15pm

Hi @joris !!

Nothing on logs either? That’s strange.

Just to confirm, you are now using bartch/GRPC instead of directly sending thru rest, right?

joris · April 10, 2025, 9:24am

We did not yet try GRPC instead of REST API. Let me get back to you after we tried GRPC

DudaNogueira · April 10, 2025, 6:18pm

@joris Also, is your client connecting to Weaviate thru a reverse proxy?

We have seen situations where the reverse proxy may be timing out, so that would also explain Weaviate not showing any resource pressure yet it timing out

Bjorn · April 17, 2025, 9:02am

No, our clients are

We actaully use the internal service URL directly. So “http://weaviate.[namespace]” which is on the same cluster and nodes.

Bjorn · April 17, 2025, 11:59am

I have complete metrics by now. What should i look out for? Not too sure yet.

We just got another timout (cURL error 28: Operation timed out after 30001 milliseconds with 0 bytes received (see libcurl - Error Codes) for http://weaviate.weaviate-a/v1/graphql)

When i look at the metrics at that time it does seem like there is a small spike in avg(sum by() (rate(vector_index_queue_delete_count[1m0s]))) at that same moment.

Also a peak in “get_true_positive”, ofcourse also in query latency, which makes sense (90 percenile is like 50s). Also see a little peak in “Active Tombstones in HNSW index”.

Not sure what to do here.

Doet seem to be related to deletes, i see some timeouts every time there are some little peaks on deletes.

DudaNogueira · April 17, 2025, 7:54pm

hi @Bjorn !

Do you see any outstanding logs when that timeout occurred?

the fact that it is 30s seems to be some default value.

are you doing the query directly using curl or using our python client to perform raw graphql queries?

Bjorn · April 18, 2025, 7:00am

We use GitHub - timkley/weaviate-php: A PHP client to communicate with a Weaviate instance.

Also possibly relevant is my thread on the slack: Slack

DudaNogueira · April 22, 2025, 3:26pm

Hi!

Have you increased the timeout on the client?

github.com/timkley/weaviate-php

src/Weaviate.php

00b9e3fe9


      
          private Meta $meta;
          
          /**
           * @param array<string, string> $additionalHeaders
           */
          public function __construct(
              string $apiUrl,
              #[\SensitiveParameter]
              string $apiToken,
              array $additionalHeaders = [],
              int $timeout = 30
          ) {
              $this->api = new Api($apiUrl, $apiToken, $additionalHeaders, $timeout);
          }
          
          public function schema(): Schema
          {
              return $this->schema ??= new Schema($this->api);
          }
          
          public function dataObject(): DataObject

Bjorn · April 22, 2025, 3:58pm

We remembered we made that issue and increased the timeout there today. Funny how things go.

We also thpught about how to minimize deletions by doing smarter updating od chunks by comparing chunks and delte only changed chunks while updating. This should also eleviate the amount of deletions in the weaviate instances.

I suspect 2 things. Storage is relatively slow (ha network storage) which doesnt help. And we delete all chunks of a doument while updating which means peaks of delete and insert. We can make that less.

We also lowered the amount of deletions per cycle to something very low to see if that helps.

We’ll get there, but i think there might also be room to make this porcess beter in weaviate. Not sure if ypu give time for other processes between deletions or if it is one big transaction (or something similar). Givong time to other processes while doong cleanup by batching them and then forst do other stuff before doonng the next batch in the cycle could help. I must say i didnt look in the source to know how the process works.

DudaNogueira · April 22, 2025, 6:03pm

Also notice that the mentioned PHP Client is using graphql for querying (and REST for deleting) while our newer clients are now shifting towards GRPC.

This will bring way more performance in different aspects.

Having ASYNC_INDEXING enabled will also help, as the client can get the response faster and doesn’t need to wait for the index part of the ingestion.

When possible, it is better to update an object than delete/insert.

Also feel free to join our weekly Community Office Hours:

Let me know if this helps!

Thanks!

Bjorn · April 22, 2025, 6:33pm

We have async enabled yeah, but the reindex might run for a while and will be crossing over with a cleanup run.

We inveatigated grpc, but the protocol is a bit of a hassle in php.

In regards to updating, ill reconsider that with my colleague, see if it is viable. Not aure if we can do that easyly

DudaNogueira · April 22, 2025, 7:00pm

There are some env vars around TOMBSTONE cleanup:

You may tweak that value so your cleanup cycles doesn’t take over your CPU, affecting other operation.

Topic		Replies	Views
[QUESTION] Async replication hashbeat fails with context deadline timeout Support bug , python , technical	6	109	February 25, 2025
Weaviate cluster is very unstable (1.29.2) Support	8	151	April 9, 2025
Client.Timeout exceeded while awaiting headers - During insertion data in weaviate Support	1	104	January 23, 2025
An error occurred: The 'objects' creation was cancelled because it took longer than the configured timeout of 60s. Try reducing the batch size (currently 1) to a lower value. Aim to on average complete batch request within less than 10s Support bug	1	68	October 15, 2024
Vectorizer Timeout settings and behavior Support	4	370	July 26, 2024

Write timeout in combination with replicas

Description

Server Setup Information

Any additional Information

Related topics