Storing the data to weaviate

Hello,
I have a question about storing data into the weaviate (inside a docker in a virtual machine).
When i push some data into weaviate, it needs to connect to openIA, but, all the data stays in local? Because its a confidential data that we’re storing in to the docker and i need to be sure that all the data stays always in Local.

Thanks!

Hi, if you use OpenAI as your vectorizer your data will be sent to them, vectorized and the vector returned and inserted along with your object. No way around this.

If your need is to never send data, then you must look into vectorising with a self-hosted model, either manually, meaning that you will not let Weaviate vectorize automatically and manually compute the vector and add it to the object data, or use the text2vector-transformer module and package your model in a container which you will use as the default vectorizer module. Hope this helps a little.

Ciao from Roma !

1 Like

Thanks!

Now I’m trying to incorporate a local vectorize model, but i think i’m doing something wrong. What i’m doing is using the parameter “vector” in the add_data_object function, and using de en_core_web_lg model from spaCY to create the vector.

How can I make weaviate to not vectorize automatically?

This magic happens in three places :slight_smile:

In the weaviate docker-compose.yml file you do not declare any modules:

services:
  weaviate:
    volumes:
      - weaviate_data:/var/lib/weaviate
    image: semitechnologies/weaviate:1.24.21
    ports:
      - 8077:8080
      - 50051:50051
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      DEFAULT_VECTORIZER_MODULE: "none"
      CLUSTER_HOSTNAME: "node1"

volumes:
  weaviate_data:

as you can see there’s a DEFAULT_VECTORIZER_MODULE: “none”

the second is where you define your collection:

 client.collections.create(
        schema_name,
        description="A class to store articles with a semantic kicker and searchable author.",
        vectorizer_config=None,
        inverted_index_config=wvcc.Configure.inverted_index(
            index_property_length=True,
            stopwords_preset=None,
            stopwords_additions=isagog_stopwords,
        ),
        vector_index_config=wvcc.Configure.VectorIndex.hnsw(
            distance_metric=wvcc.VectorDistances.COSINE
        ),
        properties=[
            # default tokenization is tokenization=wvcc.Tokenization.WORD
            wvcc.Property(name="app_id", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # app generated publicationDay-slug   
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="category", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="excerpt", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="kicker", data_type=wvcc.DataType.TEXT), # to be vectorized
            wvcc.Property(name="locmentions", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="orgmentions", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="permentions", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="publicationDay", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # search/filter
            wvcc.Property(name="tag", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # search/filter
            wvcc.Property(name="title", data_type=wvcc.DataType.TEXT), # to be vectorized
            wvcc.Property(name="topic", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # search/filter
        ]
    )

and again you can see a vectorizer_config=None

Then when you insert your object you must insert both the data and ist manually derived vector. Something along the following lines:

try:
        with client.batch.dynamic() as batch:
            for item in data:   
                vector = item.pop('vector')  # This removes and returns the vector
                properties = item  # The rest of the data is now in 'properties'         
                batch.add_object(properties=properties, collection=schema_name, vector=vector)

hope this helps. Hasta la victoria !!! :slight_smile:

Thanks!
It will help for sure!

1 Like