Storing the data to weaviate

garcia.e · April 8, 2024, 12:29pm

Hello,
I have a question about storing data into the weaviate (inside a docker in a virtual machine).
When i push some data into weaviate, it needs to connect to openIA, but, all the data stays in local? Because its a confidential data that we’re storing in to the docker and i need to be sure that all the data stays always in Local.

Thanks!

rjalex · April 8, 2024, 4:43pm

Hi, if you use OpenAI as your vectorizer your data will be sent to them, vectorized and the vector returned and inserted along with your object. No way around this.

If your need is to never send data, then you must look into vectorising with a self-hosted model, either manually, meaning that you will not let Weaviate vectorize automatically and manually compute the vector and add it to the object data, or use the text2vector-transformer module and package your model in a container which you will use as the default vectorizer module. Hope this helps a little.

Ciao from Roma !

garcia.e · April 11, 2024, 10:22am

Thanks!

Now I’m trying to incorporate a local vectorize model, but i think i’m doing something wrong. What i’m doing is using the parameter “vector” in the add_data_object function, and using de en_core_web_lg model from spaCY to create the vector.

How can I make weaviate to not vectorize automatically?

rjalex · April 11, 2024, 5:14pm

This magic happens in three places

In the weaviate docker-compose.yml file you do not declare any modules:

services:
  weaviate:
    volumes:
      - weaviate_data:/var/lib/weaviate
    image: semitechnologies/weaviate:1.24.21
    ports:
      - 8077:8080
      - 50051:50051
    restart: on-failure:0
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      DEFAULT_VECTORIZER_MODULE: "none"
      CLUSTER_HOSTNAME: "node1"

volumes:
  weaviate_data:

as you can see there’s a DEFAULT_VECTORIZER_MODULE: “none”

the second is where you define your collection:

 client.collections.create(
        schema_name,
        description="A class to store articles with a semantic kicker and searchable author.",
        vectorizer_config=None,
        inverted_index_config=wvcc.Configure.inverted_index(
            index_property_length=True,
            stopwords_preset=None,
            stopwords_additions=isagog_stopwords,
        ),
        vector_index_config=wvcc.Configure.VectorIndex.hnsw(
            distance_metric=wvcc.VectorDistances.COSINE
        ),
        properties=[
            # default tokenization is tokenization=wvcc.Tokenization.WORD
            wvcc.Property(name="app_id", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # app generated publicationDay-slug   
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="category", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="excerpt", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="kicker", data_type=wvcc.DataType.TEXT), # to be vectorized
            wvcc.Property(name="locmentions", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="orgmentions", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="permentions", data_type=wvcc.DataType.TEXT), # search/filter
            wvcc.Property(name="publicationDay", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # search/filter
            wvcc.Property(name="tag", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # search/filter
            wvcc.Property(name="title", data_type=wvcc.DataType.TEXT), # to be vectorized
            wvcc.Property(name="topic", data_type=wvcc.DataType.TEXT, tokenization=wvcc.Tokenization.FIELD), # search/filter
        ]
    )

and again you can see a vectorizer_config=None

Then when you insert your object you must insert both the data and ist manually derived vector. Something along the following lines:

try:
        with client.batch.dynamic() as batch:
            for item in data:   
                vector = item.pop('vector')  # This removes and returns the vector
                properties = item  # The rest of the data is now in 'properties'         
                batch.add_object(properties=properties, collection=schema_name, vector=vector)

hope this helps. Hasta la victoria !!!

garcia.e · April 12, 2024, 10:08am

Thanks!
It will help for sure!

Topic		Replies	Views
Should I choose Weaviate for my first project? Support	2	361	December 19, 2023
Local Embed vs Weaviate Module Support	6	1206	October 19, 2023
Create persisted database Support technical	8	292	March 17, 2025
Start weaviate embeded with openai token and persistent storage General	3	203	November 4, 2024
[Question] Error: Can't get standard auto vectorization to run? Support technical	7	244	October 7, 2024

Storing the data to weaviate

Related topics