Support needed for fixing Weaviate performance issues

Description

We are experiencing performance issues with a Weaviate multi-node deployment running in Kubernetes. Problems we are facing:

  • Memory usage. The memory keeps increasing heavily as there’s more data in the cluster. We can also see it increases every time we perform a backup and it never goes down to the starting level.
  • Timeouts. We are getting 500 and 502 errors. We are exposing Weaviate through ingress and have ensured the timeouts are not occurring at the LB level.

We have tried the following:

  • Update environment variables to optimise memory usage.
  • Updated LB and Weaviate timeout from 60 to 600.
  • Upgrade Weaviate to version 1.26.x.
  • Refactor collections to use a better replication factor/sharding configuration.
  • Increase Kubernetes nodes capacity and pod requested resources.

Logs are not being really helpful. Sometimes we can see context deadline exceeded but sometimes there are no logs at all until we restart the pods.

When we restart the pods, the memory decreases for a bit and things start working again. However, we have ended up in a situation where we don’t think the memory usage is the only root cause of the issues, as sometimes the pods have enough space to scale memory and the application keeps malfunctioning.

Below you can see the values we are using to deploy the Helm chart.

Server Setup Information

  • Weaviate Server Version: 1.25.0
  • Deployment Method: Kubernetes - Helm chart (17.0.0)
  • Multi Node? Number of Running Nodes: 3
  • Client Language and Version: Python3
  • Multitenancy?: No

Any additional Information

Values used to deploy Helm chart:

USER-SUPPLIED VALUES:
annotations:
  ad.datadoghq.com/weaviate.checks: |-
    {
      "weaviate": {
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:2112/metrics"
          }
        ]
      }
    }
args:
- --host
- 0.0.0.0
- --port
- "8080"
- --scheme
- http
- --config-file
- /weaviate-config/conf.yaml
- --read-timeout=600s
- --write-timeout=600s
authentication:
  anonymous_access:
    enabled: false
  apikey:
    allowed_keys:
    - xxxxxxxxxxxxxxxxxx
    enabled: true
    users:
    - api-key-user-admin
authorization:
  admin_list:
    enabled: true
    users:
    - api-key-user-admin
backups:
  s3:
    enabled: true
    envconfig:
      AWS_REGION: us-east-1
      BACKUP_S3_BUCKET: weaviate-backup-platform-platform-prod-us-east-1
debug: false
env:
  LIMIT_RESOURCES: true
  #LOG_LEVEL: trace
  PERSISTENCE_LSM_ACCESS_STRATEGY: pread
  PROMETHEUS_MONITORING_ENABLED: true
  PROMETHEUS_MONITORING_GROUP: false
  QUERY_SLOW_LOG_ENABLED: true
modules:
  generative-cohere:
    enabled: true
  generative-openai:
    enabled: true
  generative-palm:
    enabled: true
  qna-openai:
    enabled: true
  ref2vec-centroid:
    enabled: true
  reranker-cohere:
    enabled: true
  text2vec-cohere:
    enabled: true
  text2vec-huggingface:
    enabled: true
  text2vec-openai:
    enabled: true
  text2vec-palm:
    enabled: true
query_defaults:
  limit: 100
replicas: 3
resources:
  requests:
    cpu: 100m
    memory: 150Gi
service:
  name: weaviate
  ports:
  - name: https
    port: 443
    protocol: TCP
  type: ClusterIP
serviceAccountName: weaviate-sa
storage:
  size: 300Gi
  storageClassName: weaviate-sc-platform

All collections were originally created with replication factor 3 and no sharding. Now we are trying to refactor all of them to optimise the setup. We are recreating the collections in a new cluster with the right config, and importing data.

In the image below it can be seen how around 2am a backup is performed and the memory is increased. It never goes down again until we restart the pods (13:00).

Log error examples:

connect: Get "http://10.10.17.12:7001/replicas/indices/collection_name/shards/oc2DaAtccNfT/objects/613aa6d3-b328-4221-8754-ee5fa446c5be?schema_version=0": context canceled
"http://10.10.21.74:7001/replicas/indices/collection_name/shards/oc2DaAtccNfT:commit?request_id=weaviate-1-64-191e683965f-3ee": context deadline exceeded 10.10.21.74:7001: connect: Post 

hi @Oscar_Dalmau_Roig !!

Welcome to our community :hugs:

Have you tried defining a limit to VectorCacheMaxObjects?

For more info on this:

Maybe it can hold down the memory during backups.