EKS Multi-Replica Raft Migration Failure in Weaviate 1.25.0

Description

Hello, followed this doc to upgrade from Weaviate 1.22.6 to 1.25.0 on a 3-replica cluster resulted in incomplete Raft schema migration, causing 7 out of 19 classes to become inaccessible and failing with “shard not found” errors. The issue only occurs on multi-replica clusters. Single-replica on staging upgraded successfully. No errors in pod logs during upgrade - pods started successfully, no crashes.

Server Setup Information

  • Weaviate Server Version: 1.22.6 (attempted upgrade to 1.25.0, rolled back)
  • Deployment Method: Helm chart on AWS EKS Fargate
  • Multi Node? Number of Running Nodes: Yes, 3 nodes (StatefulSet with 3 replicas)
  • **Client Language and Version:**Python client v3
  • Multitenancy?: Yes - 5 out of 19 classes use multi-tenancy with 23-39 tenants each

Additional Information

Environment Details
Platform: AWS EKS Fargate
Storage: AWS EFS with persistent volumes
ReplicationFactor: 1 (no data redundancy across nodes)

image:
  registry: docker.io
  tag: 1.22.6
  repo: semitechnologies/weaviate
  pullPolicy: IfNotPresent
  pullSecrets: []

command: ["/bin/weaviate"]
args:
  - '--host'
  - '0.0.0.0'
  - '--port'
  - '8080'
  - '--scheme'
  - 'http'
  - '--config-file'
  - '/weaviate-config/conf.yaml'
  - --read-timeout=120s 
  - --write-timeout=120s

initContainers:
  sysctlInitContainer:
    enabled: true
    sysctlVmMaxMapCount: 524288
    image:
      registry: docker.io
      repo: alpine
      tag: latest
      pullPolicy: IfNotPresent
  
  extraInitContainers: {}

# 3-replica cluster configuration
replicas: 3

# Resource configuration
resources:
  requests:
    cpu: '16000m'
    memory: '64Gi'
  limits:
    cpu: '16000m'
    memory: '80Gi'

securityContext: {}

serviceAccountName:

# Persistent storage using AWS EFS
storage:
  size: 100Gi
  storageClassName: "efs-sc"

# Service configuration (NodePort for internal ALB)
service:
  name: weaviate
  ports:
    - name: http
      protocol: TCP
      port: 80
  type: NodePort
  loadBalancerSourceRanges: []
  clusterIP:
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/cross-zone-load-balancing-enabled: "true"
    alb.ingress.kubernetes.io/scheme: internet-facing
    # Subnet IDs redacted
    service.beta.kubernetes.io/aws-load-balancer-subnets: <REDACTED>
    service.beta.kubernetes.io/aws-load-balancer-type: "ip"

# Probes configuration
startupProbe:
  enabled: false
  initialDelaySeconds: 300
  periodSeconds: 60
  failureThreshold: 50
  successThreshold: 1
  timeoutSeconds: 3

livenessProbe:
  initialDelaySeconds: 900
  periodSeconds: 10
  failureThreshold: 30
  successThreshold: 1
  timeoutSeconds: 3

readinessProbe:
  initialDelaySeconds: 120
  periodSeconds: 10
  failureThreshold: 10
  successThreshold: 1
  timeoutSeconds: 15

terminationGracePeriodSeconds: 150

# Weaviate Authentication Configuration
authentication:
  apikey:
    enabled: true
    allowed_keys:
      - '<REDACTED_API_KEY_1>'
      - '<REDACTED_API_KEY_2>'
    users:
      - admin@example.com
      - readonly@example.com
  anonymous_access:
    enabled: false
  oidc:
    enabled: false

# Authorization Configuration
authorization:
  admin_list:
    enabled: true
    users:
      - admin@example.com
    readonly_users:
      - readonly@example.com

query_defaults:
  limit: 100

debug: false

# Environment variables
env:
  CLUSTER_GOSSIP_BIND_PORT: 7000
  CLUSTER_DATA_BIND_PORT: 7001
  
  # Aggressive GC settings for memory management
  GOGC: 50
  LIMIT_RESOURCES: true

  # Prometheus metrics enabled
  PROMETHEUS_MONITORING_ENABLED: true

  # GOMEMLIMIT set to 60GB (64424509440 bytes)
  # Note: This is critical for preventing OOM kills
  GOMEMLIMIT: "64424509440"

  # Query limits
  QUERY_MAXIMUM_RESULTS: 15000

  # Vector dimension tracking disabled for performance
  TRACK_VECTOR_DIMENSIONS: false
  REINDEX_VECTOR_DIMENSIONS_AT_STARTUP: false

envSecrets: {}

# Backup providers (all disabled)
backups:
  filesystem:
    enabled: false
  s3:
    enabled: false
  gcs:
    enabled: false
  azure:
    enabled: false

# Modules configuration - all disabled (no vectorization modules)
modules:
  text2vec-contextionary:
    enabled: false
  text2vec-transformers:
    enabled: false
  text2vec-openai:
    enabled: false
  text2vec-huggingface:
    enabled: false
  text2vec-cohere:
    enabled: false
  text2vec-palm:
    enabled: false
  ref2vec-centroid:
    enabled: false
  multi2vec-clip:
    enabled: false
  qna-transformers:
    enabled: false
  qna-openai:
    enabled: false
  generative-openai:
    enabled: false
  generative-cohere:
    enabled: false
  generative-palm:
    enabled: false
  img2vec-neural:
    enabled: false
  reranker-cohere:
    enabled: false
  reranker-transformers:
    enabled: false
  text-spellcheck:
    enabled: false
  ner-transformers:
    enabled: false
  sum-transformers:
    enabled: false
  
  # No default vectorizer - using external embedding services
  default_vectorizer_module: none

custom_config_map:
  enabled: false
  name: 'custom-config'

annotations:

nodeSelector:

tolerations:

# Pod anti-affinity to spread replicas across nodes
affinity:
  podAntiAffinity:
    preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 1
        podAffinityTerm:
          topologyKey: "kubernetes.io/hostname"
          labelSelector:
            matchExpressions:
              - key: "app"
                operator: In
                values:
                  - weaviate


Hi!

We do not recommend jumping versions.

So for this migration, you will need:
1.22.6 → restrt → 1.22.latest → restart → 1.23.latest → restart → 1.24.latest

Note that this is a significant migration. A lot has changed since 1.22

Depending on your dataset size, it’s probably easier to spin a new cluster and migrate your data over, using this migration guide: Migrate data | Weaviate Documentation

Let me know if this helps!