[STORAGE][LARGER DATASETS] Can we use the NFS for PERSISTENT_DATA?

Hello folks,

we are trying to validate and see if Weaviate DB is a perfect match for our use case scenarios.

Our datasets are pretty big, datasets may range from 100 Million to couple of billion vectors generated in few hours. with higher dimensions >768

Hence local storage is not feasible for us at this scale.
Hence I have couple of questions that i seek help from the community.

  1. Can we use an S3 or MinIO type of object storage endpoints as persistent storage directory with Weaviate (like Milvus for reference) ?

(or)

  1. I read about EKS somewhere in the documentation . Hence could we use NFS based options for persistent storage for the k8s cluster deployments .

Which one is recommended for performance and data consistency ?

Also how Weavieate behave if we use the NFS as PERSISTENCE_DATA_PATH when i deploy multiple replicas ? How are the reads and writes are load balanced. ?

Will all the pods write to the same NFS data path. (e.x: My Persistent data path is set to β€œ/mnt/weavieatedb/data/”) or do we need to configure, each pod should be pointed to separate data folder. ?

Pod0 ==> (β€œ/mnt/weavieatedb/data_0/”)
Pod1 ==> (β€œ/mnt/weavieatedb/data_1/”)
PodN. ==> (β€œ/mnt/weavieatedb/data_n/”)

We have huge NFS share (NFS share is presented from a storage box) . Hence performance and throughput should not be a concern.

  1. Can we use one PVC created out of the NFS share and present it to values.yaml when deploying via helm chart
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: wv-storage
  namespace: wvdb  # Match the namespace
spec:
  accessModes:
    - ReadWriteMany
  storageClassName: ""
  resources:
    requests:
      storage: 500Gi
  volumeName: wv-nfs-pv

and my Persistent volume is this:

apiVersion: v1
kind: PersistentVolume
metadata:
  name: wv-nfs-pv
spec:
  capacity:
    storage: 500Gi
  accessModes:
    - ReadWriteMany
  nfs:
    path: /wvdb  # Path on NFS server
    server: mystorageserver  # NFS server IP

In my values.yaml:

storage:
fullnameOverride: β€œwv-storage”
size: 500Gi
storageClassName: β€œβ€

When i deploy the helm chart . I coud only see one pod . Is Weaviate not a distributed architecture o

Every 2.0s: kubectl get pods -n wvdb                                                                                                                                                 sn1-r6515-h01-05: Fri Oct 11 07:36:39 2024

NAME         READY   STATUS    RESTARTS   AGE
weaviate-0   1/1     Running   0          100s


hi @Adi_Sra_Ga !

As long as Weaviate can write to the persistent path, it should work. I am not sure NFS will deliver the best performance.

For this mount of objects, I strongly suggest you contacting our sales team so we can arrange a call with our team on how to better architect a solution for you.

Each replica should have it’s own PERSISTENE_DATA_PATH.

You can use your own PVC. Check here for more info on that.

In order to run multiple nodes while deploying with our oficial helm, you will need to change the replicas definition in your values.yml (here the link).

Let me know if this helps!

Thanks!

Hi @DudaNogueira ,

Thanks for the reply , Is S3 bucket or any object based storage buckets are supported if i use that as PERSISTENT DATA endpoints. I can only see the configurations for backup in S3 ?

Hi!

Those are only suggested to be used for the backup storage, not the PERSISTENT_DATA, as Weaviate needs quick read/write to that PATH.