Weaviate backup failing- "cannot resolve hostname for "weaviate-3"

Description

We have a Weaviate cluster deployed via Helm on Azure Kubernetes Service (AKS), backed by Azure PVC. The cluster has successfully ingested a substantial number of documents.

However, when attempting to initiate a backup, the process fails with an error stating that a node “cannot resolve hostname for”.

We suspect this issue might be related to manually increasing the replica count from 3 to 4 at some point, although we’re not entirely certain. Currently, there are only 3 pods running in the cluster.


Server Setup Information

  • Cluster Type: AKS (Azure Kubernetes Service)

  • Deployment Method: Helm

  • Weaviate Version: 1.30.0

  • Modules Enabled: backup-azure

Environment Variables / Helm Values:

- name: AUTHENTICATION_APIKEY_ENABLED
  value: "true"
- name: AUTHENTICATION_APIKEY_USERS
  value: admin
- name: CLUSTER_DATA_BIND_PORT
  value: "7001"
- name: CLUSTER_GOSSIP_BIND_PORT
  value: "7000"
- name: GOGC
  value: "100"
- name: PROMETHEUS_MONITORING_ENABLED
  value: "false"
- name: PROMETHEUS_MONITORING_GROUP
  value: "false"
- name: QUERY_MAXIMUM_RESULTS
  value: "100000"
- name: RAFT_BOOTSTRAP_TIMEOUT
  value: "600"
- name: REINDEX_VECTOR_DIMENSIONS_AT_STARTUP
  value: "false"
- name: TRACK_VECTOR_DIMENSIONS
  value: "false"
- name: AUTHENTICATION_APIKEY_ALLOWED_KEYS
  valueFrom:
    secretKeyRef:
      name: weaviate-secret
      key: AUTHENTICATION_APIKEY_ALLOWED_KEYS
- name: RUNTIME_OVERRIDES_ENABLED
  value: "false"
- name: RUNTIME_OVERRIDES_PATH
  value: /config/overrides.yaml
- name: RUNTIME_OVERRIDES_LOAD_INTERVAL
  value: 2m
- name: CLUSTER_BASIC_AUTH_USERNAME
  valueFrom:
    secretKeyRef:
      name: weaviate-cluster-api-basic-auth
      key: username
- name: CLUSTER_BASIC_AUTH_PASSWORD
  valueFrom:
    secretKeyRef:
      name: weaviate-cluster-api-basic-auth
      key: password
- name: PERSISTENCE_DATA_PATH
  value: /var/lib/weaviate
- name: DEFAULT_VECTORIZER_MODULE
  value: none
- name: ENABLE_MODULES
  value: backup-azure
- name: RAFT_JOIN
  value: weaviate-0,weaviate-1,weaviate-2
- name: RAFT_BOOTSTRAP_EXPECT
  value: "3"
- name: BACKUP_AZURE_CONTAINER
  value: weaviate-backups
- name: AZURE_STORAGE_CONNECTION_STRING
  valueFrom:
    secretKeyRef:
      name: weaviate-secret
      key: AZURE_STORAGE_CONNECTION_STRING
- name: CLUSTER_JOIN
  value: weaviate-headless.weaviate.svc.cluster.local.


Cluster Details

  • Weaviate Version: 1.30.0

  • Number of Running Nodes: 3

  • Multitenancy: Enabled


Additional Information

  • Backups are configured using the backup-azure module

  • The suspected root cause is a mismatch between the Helm configuration (RAFT_BOOTSTRAP_EXPECT=3) and manual scaling (replicas=4).

Hi!

Welcome to our community :hugs:

Was the backup created while the cluster had replica factor of 4 and now you want to restore to a factor 3 cluster?

You should have the exact same number of pods to backup to and restore from.

Let me know if this is the scenario.

Thanks!

Hi [@DudaNogueira]()

Thanks for coming back. We identified that our scaled-down multi-tenant enabled collection action caused this behavior. We were able to take the backup after bringing the cluster size back to 4. I assume this might be regarding RAFT state information stored on the other nodes. (The 4th node was not holding any tenant shard before, though.)

Also, thank you for confirming that while restoring, we require the exact number of nodes.

hi @Tibin !!

Glad it all worked out. Our team is working on shard movements feature, that will allow you. to increase and tweak replication factors, or drain nodes, etc.

For now, unless you are using only multitenant, the best way to grow your cluster is moving to a new one with the necessary resources.

THanks!

1 Like