Support needed for fixing Weaviate performance issues

Description

We are experiencing performance issues with a Weaviate multi-node deployment running in Kubernetes. Problems we are facing:

  • Memory usage. The memory keeps increasing heavily as there’s more data in the cluster. We can also see it increases every time we perform a backup and it never goes down to the starting level.
  • Timeouts. We are getting 500 and 502 errors. We are exposing Weaviate through ingress and have ensured the timeouts are not occurring at the LB level.

We have tried the following:

  • Update environment variables to optimise memory usage.
  • Updated LB and Weaviate timeout from 60 to 600.
  • Upgrade Weaviate to version 1.26.x.
  • Refactor collections to use a better replication factor/sharding configuration.
  • Increase Kubernetes nodes capacity and pod requested resources.

Logs are not being really helpful. Sometimes we can see context deadline exceeded but sometimes there are no logs at all until we restart the pods.

When we restart the pods, the memory decreases for a bit and things start working again. However, we have ended up in a situation where we don’t think the memory usage is the only root cause of the issues, as sometimes the pods have enough space to scale memory and the application keeps malfunctioning.

Below you can see the values we are using to deploy the Helm chart.

Server Setup Information

  • Weaviate Server Version: 1.25.0
  • Deployment Method: Kubernetes - Helm chart (17.0.0)
  • Multi Node? Number of Running Nodes: 3
  • Client Language and Version: Python3
  • Multitenancy?: No

Any additional Information

Values used to deploy Helm chart:

USER-SUPPLIED VALUES:
annotations:
  ad.datadoghq.com/weaviate.checks: |-
    {
      "weaviate": {
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:2112/metrics"
          }
        ]
      }
    }
args:
- --host
- 0.0.0.0
- --port
- "8080"
- --scheme
- http
- --config-file
- /weaviate-config/conf.yaml
- --read-timeout=600s
- --write-timeout=600s
authentication:
  anonymous_access:
    enabled: false
  apikey:
    allowed_keys:
    - xxxxxxxxxxxxxxxxxx
    enabled: true
    users:
    - api-key-user-admin
authorization:
  admin_list:
    enabled: true
    users:
    - api-key-user-admin
backups:
  s3:
    enabled: true
    envconfig:
      AWS_REGION: us-east-1
      BACKUP_S3_BUCKET: weaviate-backup-platform-platform-prod-us-east-1
debug: false
env:
  LIMIT_RESOURCES: true
  #LOG_LEVEL: trace
  PERSISTENCE_LSM_ACCESS_STRATEGY: pread
  PROMETHEUS_MONITORING_ENABLED: true
  PROMETHEUS_MONITORING_GROUP: false
  QUERY_SLOW_LOG_ENABLED: true
modules:
  generative-cohere:
    enabled: true
  generative-openai:
    enabled: true
  generative-palm:
    enabled: true
  qna-openai:
    enabled: true
  ref2vec-centroid:
    enabled: true
  reranker-cohere:
    enabled: true
  text2vec-cohere:
    enabled: true
  text2vec-huggingface:
    enabled: true
  text2vec-openai:
    enabled: true
  text2vec-palm:
    enabled: true
query_defaults:
  limit: 100
replicas: 3
resources:
  requests:
    cpu: 100m
    memory: 150Gi
service:
  name: weaviate
  ports:
  - name: https
    port: 443
    protocol: TCP
  type: ClusterIP
serviceAccountName: weaviate-sa
storage:
  size: 300Gi
  storageClassName: weaviate-sc-platform

All collections were originally created with replication factor 3 and no sharding. Now we are trying to refactor all of them to optimise the setup. We are recreating the collections in a new cluster with the right config, and importing data.

In the image below it can be seen how around 2am a backup is performed and the memory is increased. It never goes down again until we restart the pods (13:00).

Log error examples:

connect: Get "http://10.10.17.12:7001/replicas/indices/collection_name/shards/oc2DaAtccNfT/objects/613aa6d3-b328-4221-8754-ee5fa446c5be?schema_version=0": context canceled
"http://10.10.21.74:7001/replicas/indices/collection_name/shards/oc2DaAtccNfT:commit?request_id=weaviate-1-64-191e683965f-3ee": context deadline exceeded 10.10.21.74:7001: connect: Post 

hi @Oscar_Dalmau_Roig !!

Welcome to our community :hugs:

Have you tried defining a limit to VectorCacheMaxObjects?

For more info on this:

Maybe it can hold down the memory during backups.

Hey @DudaNogueira - thanks for your reply!

We are using this value (half the default value):
“vectorCacheMaxObjects”:500000000000

We have also updated our setup with the following configuration:

Server Setup Information

  • Weaviate Server Version: 1.26.4
  • Deployment Method: Kubernetes - Helm chart (17.2.1)
  • Multi Node? Number of Running Nodes: 6
  • Client Language and Version: Python3
  • Multitenancy?: No

Any additional Information

Values used to deploy Helm chart:

USER-SUPPLIED VALUES:
annotations:
  ad.datadoghq.com/weaviate.checks: |-
    {
      "weaviate": {
        "instances": [
          {
            "openmetrics_endpoint": "http://%%host%%:2112/metrics"
          }
        ]
      }
    }
args:
- --host
- 0.0.0.0
- --port
- "8080"
- --scheme
- http
- --config-file
- /weaviate-config/conf.yaml
- --read-timeout=600s
- --write-timeout=600s
authentication:
  anonymous_access:
    enabled: false
  apikey:
    allowed_keys:
    - xxxxxxxxxxxxxxxx
    enabled: true
    users:
    - api-key-user-admin
authorization:
  admin_list:
    enabled: true
    users:
    - api-key-user-admin
backups:
  s3:
    enabled: true
    envconfig:
      AWS_REGION: us-east-1
      BACKUP_S3_BUCKET: weaviate-backup-platform-platform-prod-us-east-1
debug: false
env:
  GOGC: 20
  GOMEMLIMIT: 50GiB
  PERSISTENCE_LSM_ACCESS_STRATEGY: pread
  PROMETHEUS_MONITORING_ENABLED: true
  PROMETHEUS_MONITORING_GROUP: false
  QUERY_SLOW_LOG_ENABLED: true
  RAFT_ENABLE_FQDN_RESOLVER: true
  RAFT_FQDN_RESOLVER_TLD: weaviate-headless.weaviate-platform.svc.cluster.local
image:
  tag: 1.26.4
modules:
  generative-cohere:
    enabled: true
  generative-openai:
    enabled: true
  generative-palm:
    enabled: true
  qna-openai:
    enabled: true
  ref2vec-centroid:
    enabled: true
  reranker-cohere:
    enabled: true
  text2vec-cohere:
    enabled: true
  text2vec-huggingface:
    enabled: true
  text2vec-openai:
    enabled: true
  text2vec-palm:
    enabled: true
query_defaults:
  limit: 100
replicas: 6
resources:
  requests:
    cpu: 12000m
    memory: 100Gi
service:
  name: weaviate
  ports:
  - name: https
    port: 443
    protocol: TCP
  type: ClusterIP
serviceAccountName: weaviate-sa
storage:
  size: 400Gi
  storageClassName: weaviate-sc-platform

Our collections have the following configuration:

  • Replication factor: 2
  • Sharding: 24
  • Vector compression enabled
  • Max number of cached objects: 500000000000

With the changes described above, we have got some better results in terms of memory and performance. However, backups are still an issue. Memory increases a lot during the execution of backups and it does not go back to the starting point. Restarting the pods frees some memory, but that’s is not an acceptable solution.

Please let us know if you need any further information. Would appreciate some support here.

hi @Oscar_Dalmau_Roig !

GOMEMLIMIT is recommended to be something around 10% to 20% of the total memory, as per the doc.

Have you tried changing that?

I will need to ask internally about this.

Sorry for the delay, we were out the last week on a off site meeting.

Thanks!

hi @Oscar_Dalmau_Roig !!

HAve you tried changing the cpuPercentage as stated here?

While we have seen some spikes in backups, our team were not able to consistently reproduce this.