Description
We are experiencing performance issues with a Weaviate multi-node deployment running in Kubernetes. Problems we are facing:
- Memory usage. The memory keeps increasing heavily as there’s more data in the cluster. We can also see it increases every time we perform a backup and it never goes down to the starting level.
- Timeouts. We are getting 500 and 502 errors. We are exposing Weaviate through ingress and have ensured the timeouts are not occurring at the LB level.
We have tried the following:
- Update environment variables to optimise memory usage.
- Updated LB and Weaviate timeout from 60 to 600.
- Upgrade Weaviate to version 1.26.x.
- Refactor collections to use a better replication factor/sharding configuration.
- Increase Kubernetes nodes capacity and pod requested resources.
Logs are not being really helpful. Sometimes we can see context deadline exceeded
but sometimes there are no logs at all until we restart the pods.
When we restart the pods, the memory decreases for a bit and things start working again. However, we have ended up in a situation where we don’t think the memory usage is the only root cause of the issues, as sometimes the pods have enough space to scale memory and the application keeps malfunctioning.
Below you can see the values we are using to deploy the Helm chart.
Server Setup Information
- Weaviate Server Version: 1.25.0
- Deployment Method: Kubernetes - Helm chart (17.0.0)
- Multi Node? Number of Running Nodes: 3
- Client Language and Version: Python3
- Multitenancy?: No
Any additional Information
Values used to deploy Helm chart:
USER-SUPPLIED VALUES:
annotations:
ad.datadoghq.com/weaviate.checks: |-
{
"weaviate": {
"instances": [
{
"openmetrics_endpoint": "http://%%host%%:2112/metrics"
}
]
}
}
args:
- --host
- 0.0.0.0
- --port
- "8080"
- --scheme
- http
- --config-file
- /weaviate-config/conf.yaml
- --read-timeout=600s
- --write-timeout=600s
authentication:
anonymous_access:
enabled: false
apikey:
allowed_keys:
- xxxxxxxxxxxxxxxxxx
enabled: true
users:
- api-key-user-admin
authorization:
admin_list:
enabled: true
users:
- api-key-user-admin
backups:
s3:
enabled: true
envconfig:
AWS_REGION: us-east-1
BACKUP_S3_BUCKET: weaviate-backup-platform-platform-prod-us-east-1
debug: false
env:
LIMIT_RESOURCES: true
#LOG_LEVEL: trace
PERSISTENCE_LSM_ACCESS_STRATEGY: pread
PROMETHEUS_MONITORING_ENABLED: true
PROMETHEUS_MONITORING_GROUP: false
QUERY_SLOW_LOG_ENABLED: true
modules:
generative-cohere:
enabled: true
generative-openai:
enabled: true
generative-palm:
enabled: true
qna-openai:
enabled: true
ref2vec-centroid:
enabled: true
reranker-cohere:
enabled: true
text2vec-cohere:
enabled: true
text2vec-huggingface:
enabled: true
text2vec-openai:
enabled: true
text2vec-palm:
enabled: true
query_defaults:
limit: 100
replicas: 3
resources:
requests:
cpu: 100m
memory: 150Gi
service:
name: weaviate
ports:
- name: https
port: 443
protocol: TCP
type: ClusterIP
serviceAccountName: weaviate-sa
storage:
size: 300Gi
storageClassName: weaviate-sc-platform
All collections were originally created with replication factor 3 and no sharding. Now we are trying to refactor all of them to optimise the setup. We are recreating the collections in a new cluster with the right config, and importing data.
In the image below it can be seen how around 2am a backup is performed and the memory is increased. It never goes down again until we restart the pods (13:00).
Log error examples:
connect: Get "http://10.10.17.12:7001/replicas/indices/collection_name/shards/oc2DaAtccNfT/objects/613aa6d3-b328-4221-8754-ee5fa446c5be?schema_version=0": context canceled
"http://10.10.21.74:7001/replicas/indices/collection_name/shards/oc2DaAtccNfT:commit?request_id=weaviate-1-64-191e683965f-3ee": context deadline exceeded 10.10.21.74:7001: connect: Post