[QUESTION] Async replication hashbeat fails with context deadline timeout

Hello! I’m currently working with Weaviate 1.26.14 (I’ve updated from 1.25.27) and have a bunch of problems:

Information

  • We have a 3-node cluster with 46mln objects each.
  • Each node has been deployed with k8s and has 30 request cpu and 40 limit and both 500GiB request and limit for RAM
  • Environment configs:
ASYNC_BRUTE_FORCE_SEARCH_LIMIT = 1000
ASYNC_INDEXING=true
DISABLE_LAZY_LOAD_SHARDS=true
DISABLE_TELEMETRY=true
FORCE_FULL_REPLICAS_SEARCH=false
GOGC=85
GOMAXPROCS=39
HNSW_STARTUP_WAIT_FOR_VECTOR_CACHE=true
LIMIT_RESOURCES=true
PERSISTENCE_HNSW_MAX_LOG_SIZE=8GiB
TOMBSTONE_DELETION_CONCURRENCY=4
  • Schema config
{
  "class": "TEST",
  "invertedIndexConfig": {
    "bm25": {
      "b": 0.75,
      "k1": 1.2
    },
    "cleanupIntervalSeconds": 60,
    "indexTimestamps": true,
    "stopwords": {
      "additions": null,
      "preset": "en",
      "removals": null
    }
  },
  "multiTenancyConfig": {
    "autoTenantActivation": false,
    "autoTenantCreation": false,
    "enabled": false
  },
  "properties": [
// cannot show
  ],
  "replicationConfig": {
    "asyncEnabled": true,
    "deletionStrategy": "NoAutomatedResolution",
    "factor": 3
  },
  "shardingConfig": {
    "actualCount": 3,
    "actualVirtualCount": 384,
    "desiredCount": 3,
    "desiredVirtualCount": 384,
    "function": "murmur3",
    "key": "_id",
    "strategy": "hash",
    "virtualPerPhysical": 128
  },
  "vectorIndexConfig": {
    "bq": {
      "enabled": false
    },
    "cleanupIntervalSeconds": 300,
    "distance": "cosine",
    "dynamicEfFactor": 8,
    "dynamicEfMax": 500,
    "dynamicEfMin": 100,
    "ef": 640,
    "efConstruction": 640,
    "flatSearchCutoff": 40000,
    "maxConnections": 64,
    "pq": {
      "bitCompression": false,
      "centroids": 256,
      "enabled": false,
      "encoder": {
        "distribution": "log-normal",
        "type": "kmeans"
      },
      "segments": 0,
      "trainingLimit": 100000
    },
    "skip": false,
    "sq": {
      "enabled": false,
      "rescoreLimit": 20,
      "trainingLimit": 100000
    },
    "vectorCacheMaxObjects": 1000000000000
  }
  • Vector dimentions - 512
  1. Slow async replication?

I don’t know the speed of the async replication but it seems to be really slow because within an hour it replicated about 50 objects (or maybe even not because other team may also use it). Delta was about 90 objects.

Then I decied that this was because of a small difference and did further steps:

  1. deleted some number of objects
  2. downscale cluster to 3 nodes
  3. import some data
  4. upscale 3rd node back

The delta become ~700k but even though async replication was really slow. (Same 1-2 object per minute or two)

Then we had some problems with disk space and we cleared out one node and returned it back to cluster (looks like adding brand new node to cluster)
Objects counts we like

weaviate-0 - 0
weaviate-1 - 46900k
weaviate-2 - 46200k

Still after that async replication is really really slow
Within 2 hours only replicated 200k objects

And from ‘adding new’ node I have the next problem

  1. Periodical timeouts

We use a python weaviate-client 4.10.4
When we used batch import before it was fine and worked correctly but after the steps I mentioned I started receive timeout errors (interesting that only after ~50k objects inserted):

{'message': 'Failed to send 1678 objects in a batch of 5000. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 1722 in a batch of 5000', 'errors': {'addr-0:7001: connect: Post "http://addr-0:7001/replicas/indices/TEST/shards/xxxx:commit?request_id=weaviate-2-64-195384f4b71-1fa": context deadline exceeded'}}
{'message': 'Failed to send 1722 objects in a batch of 5000. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
  1. Strange logs

Also I receive really strange logs on weaviate-1 (leader for now) about async replication:

hashbeat iteration failed: collecting differences: "addr-0:7001": connect: Post "http://addr:7001/replicas/indices/TEST/shards/xxxx/objects/hashtree/0?schema_version=0": context deadline exceeded

I thought it is a problem with network but when I tried to do ping from different host I had no problems

64 bytes from addr-0: seq=207 ttl=63 time=0.150 ms

So, what would you suggest to solve this?

Thank you in advance

Hello @d_khlebokazov, thanks for reaching out.
I encourage you to upgrade to latest release 1.29.0 in order to get much better results with async replication. Several settings can now be adjusted in order to optimize it for your use case (Replication | Weaviate), also performance among other improvements has landed in that release.

If upgrading is an impediment for you please let me know.

Thanks again,
Jeronimo

Hello @jeronimo_irazabal ! Thank you for a quick response

I will think about upgrading to 1.29 but now I also cannot change my TEST schema. Python library just raises a timeout error.

collection.config.update(replication_config=weaviate.classes.config.Reconfigure.replication(
        async_enabled=False))

Output -----------------
weaviate.exceptions.WeaviateTimeoutError: The request to Weaviate timed out while awaiting a response. Try adjusting the timeout config for your client. Details: Collection configuration may not have been updated.

This is kind of annoying because I wanted to try to turn off the async replication.

Also a small update about previous problems:

Even after 12 hours of waiting for async replication the situation looks like this:

weaviate-0 - 2373977
weaviate-1 - 46966225
weaviate-2 - 46793858

What is the async replication speed should be?

And I still have these logs:

hashbeat iteration failed: collecting differences: "addr-0:7001": connect: Post "http://addr-0:7001/replicas/indices/TEST/shards/xxxxx/objects/hashtree/0?schema_version=0": context deadline exceeded

It seems to be nodes problem because hosts can easily communicate with each other.

May be some weaviate cluster or schema config must be changed to solve this?

It seems to be similar to Async_replication context deadline exceeded, unable to Activate Tenant

Was it possible to find a solution? @jeronimo_irazabal

@jeronimo_irazabal
New update:
I restarted cluster and async_enabled seted false

I change it back to async_enabled: true and again have timeouts for python library

httpcore.ReadTimeout
The above exception was the direct cause of the following exception:

httpx.ReadTimeout

The above exception was the direct cause of the following exception:
File "C:\Users\khlebokazov\Desktop\weaviate-tools\benchmark\schemas\base.py", line 155, in update_schema
    collection.config.update(
  File "C:\Users\khlebokazov\Desktop\weaviate-tools\venv\Lib\site-packages\weaviate\syncify.py", line 23, in sync_method
    return _EventLoopSingleton.get_instance().run_until_complete(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\khlebokazov\Desktop\weaviate-tools\venv\Lib\site-packages\weaviate\event_loop.py", line 42, in run_until_complete
    return fut.result()
           ^^^^^^^^^^^^
  File "C:\Users\khlebokazov\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "C:\Users\khlebokazov\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "C:\Users\khlebokazov\Desktop\weaviate-tools\venv\Lib\site-packages\weaviate\collections\config\config.py", line 161, in update
    await self._connection.put(
  File "C:\Users\khlebokazov\Desktop\weaviate-tools\venv\Lib\site-packages\weaviate\connect\v4.py", line 552, in put
    return await self.__send(
           ^^^^^^^^^^^^^^^^^^
  File "C:\Users\khlebokazov\Desktop\weaviate-tools\venv\Lib\site-packages\weaviate\connect\v4.py", line 487, in __send
    raise WeaviateTimeoutError(error_msg) from read_err
weaviate.exceptions.WeaviateTimeoutError: The request to Weaviate timed out while awaiting a response. Try adjusting the timeout config for your client. Details: Collection configuration may not have been updated.

The most interesting thing is that every GET requests are working fine.

Abother update:

We have a proxy service build with FastAPI that communicates with weaviate and now because of async replication it is imposible to create object

  • CREATE:
created_uuid = await collection.data.insert(
                properties=props,
                uuid=vector_id,
                vector=embedding,
            )
"Object was not added! Unexpected status code: 500, with response body: {'error': [{'message': 'cannot achieve consistency level \"QUORUM\": read repair error: conflict: object has been deleted on another replica'}]}.

But fine when trying to change uuid (so may be this is inconsistency error)

  • DELETE time outs for 10s
  • GET works fine
  • fetch_objects works fine
  • near_vector started working awfuly, 800ms instead of 20ms

Also I have these error in logs:

error waiting for local schema to catch up to version 47: deadline exceeded for waiting for update: version got=46  want=47

and a lot of wait for update version so it seems that schema cannot be updated on weaviate-0 :frowning:

Hello @d_khlebokazov, if your objetive is to use async replication then upgrading to 1.29 should be highly recommended.

I’d only suggest to use async replication in 1.26 in case of need and impediment of upgrading.

With regards to performance, in 1.29 it was possible to repair 1mi objects in less than 15mins (with a single propagation thread, this can be customized).

In order to automatically resolve the situation when an object was deleted in a subset of the nodes but not in all of them it’s required to specify a deletion strategy:

Consistency | Weaviate.

This is not specific for async replication but also for read-repair (when nodes try to get in sync when resolving a query).