Hello! I’m currently working with Weaviate 1.26.14 (I’ve updated from 1.25.27) and have a bunch of problems:
Information
- We have a 3-node cluster with 46mln objects each.
- Each node has been deployed with k8s and has 30 request cpu and 40 limit and both 500GiB request and limit for RAM
- Environment configs:
ASYNC_BRUTE_FORCE_SEARCH_LIMIT = 1000
ASYNC_INDEXING=true
DISABLE_LAZY_LOAD_SHARDS=true
DISABLE_TELEMETRY=true
FORCE_FULL_REPLICAS_SEARCH=false
GOGC=85
GOMAXPROCS=39
HNSW_STARTUP_WAIT_FOR_VECTOR_CACHE=true
LIMIT_RESOURCES=true
PERSISTENCE_HNSW_MAX_LOG_SIZE=8GiB
TOMBSTONE_DELETION_CONCURRENCY=4
- Schema config
{
"class": "TEST",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"indexTimestamps": true,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"multiTenancyConfig": {
"autoTenantActivation": false,
"autoTenantCreation": false,
"enabled": false
},
"properties": [
// cannot show
],
"replicationConfig": {
"asyncEnabled": true,
"deletionStrategy": "NoAutomatedResolution",
"factor": 3
},
"shardingConfig": {
"actualCount": 3,
"actualVirtualCount": 384,
"desiredCount": 3,
"desiredVirtualCount": 384,
"function": "murmur3",
"key": "_id",
"strategy": "hash",
"virtualPerPhysical": 128
},
"vectorIndexConfig": {
"bq": {
"enabled": false
},
"cleanupIntervalSeconds": 300,
"distance": "cosine",
"dynamicEfFactor": 8,
"dynamicEfMax": 500,
"dynamicEfMin": 100,
"ef": 640,
"efConstruction": 640,
"flatSearchCutoff": 40000,
"maxConnections": 64,
"pq": {
"bitCompression": false,
"centroids": 256,
"enabled": false,
"encoder": {
"distribution": "log-normal",
"type": "kmeans"
},
"segments": 0,
"trainingLimit": 100000
},
"skip": false,
"sq": {
"enabled": false,
"rescoreLimit": 20,
"trainingLimit": 100000
},
"vectorCacheMaxObjects": 1000000000000
}
- Vector dimentions - 512
- Slow async replication?
I don’t know the speed of the async replication but it seems to be really slow because within an hour it replicated about 50 objects (or maybe even not because other team may also use it). Delta was about 90 objects.
Then I decied that this was because of a small difference and did further steps:
- deleted some number of objects
- downscale cluster to 3 nodes
- import some data
- upscale 3rd node back
The delta become ~700k but even though async replication was really slow. (Same 1-2 object per minute or two)
Then we had some problems with disk space and we cleared out one node and returned it back to cluster (looks like adding brand new node to cluster)
Objects counts we like
weaviate-0 - 0
weaviate-1 - 46900k
weaviate-2 - 46200k
Still after that async replication is really really slow
Within 2 hours only replicated 200k objects
And from ‘adding new’ node I have the next problem
- Periodical timeouts
We use a python weaviate-client 4.10.4
When we used batch import before it was fine and worked correctly but after the steps I mentioned I started receive timeout errors (interesting that only after ~50k objects inserted):
{'message': 'Failed to send 1678 objects in a batch of 5000. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
{'message': 'Failed to send 1722 in a batch of 5000', 'errors': {'addr-0:7001: connect: Post "http://addr-0:7001/replicas/indices/TEST/shards/xxxx:commit?request_id=weaviate-2-64-195384f4b71-1fa": context deadline exceeded'}}
{'message': 'Failed to send 1722 objects in a batch of 5000. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.'}
- Strange logs
Also I receive really strange logs on weaviate-1 (leader for now) about async replication:
hashbeat iteration failed: collecting differences: "addr-0:7001": connect: Post "http://addr:7001/replicas/indices/TEST/shards/xxxx/objects/hashtree/0?schema_version=0": context deadline exceeded
I thought it is a problem with network but when I tried to do ping from different host I had no problems
64 bytes from addr-0: seq=207 ttl=63 time=0.150 ms
So, what would you suggest to solve this?
Thank you in advance