BYOV for indexing and Vector Module for querying?

justin.godden · May 2, 2024, 2:13pm

Hi is it possible to configure the vectorizer module to only work during querying and not indexing?

My use-case is to do bulk indexing using my own hardware - to save money, and because I have a specific chunking strategy - using an open source Hugging Face model (self-hosted),

and then at query time, just use the text2vec-huggingface module to call the Hugging Face Inference API (using the same embedding model) for me.

Is there any specific settings I need to configure to ensure this works?

If I do configure the collection to have a vectorizer module, but then provide the vectors at indexing time, will Weaviate automatically ignore the configured vectorizer?

I.e. by including the vector param below (where usually you’d leave that param out when you’ve configured a vectorizer):

batch.add_object(
            properties=movie_obj,
            uuid=generate_uuid5(movie["id"]),
            vector=vector
        )

Thanks!

DudaNogueira · May 2, 2024, 5:54pm

hi @justin.godden !

Welcome to our community

Your assumption is correct. When you provide your own vectors, even if you have the vectorizer configured, Weaviate will not vectorize that object for your.

The only requirement here is that the vectors you provide have the same dimensionality and comes from the same model from the ones you are getting while on query time, otherwise you will get errors (if different dimensions number) or wrong results (same dimensions number, but from a different model).

Let me know if this helps

justin.godden · May 3, 2024, 8:34am

Hi Duda thanks for your explanation.

A follow up question, rather than BYOV for indexing, and vectorizer module for querying (as described above), can I confirm that switching between BYOV and not providing vectors (relying on the vectorizer module) would work just as seamlessly at indexing time?

My use-case: initial bulk indexing (and vectorizing) done locally (e.g. 100k items indexed) providing the vectors myself. From then on, relying on the vectorizer module (with Hugging Face Endpoint URL) to vectorize each new item added to index.

I have a specific field that I would like to use for the vector, text_to_vectorize. Would I set vectorize_collection_name to False at the collection level and set vectorize_property_name=False and skip_vectorization=True for all other properties, and vectorize_property_name=False and skip_vectorization=False for the only property that I want vectorized? Does my set up look correct?:

articles_schema = [
    wc.Property(
        name="other_field",
        data_type=wc.DataType.TEXT,
        index_filterable=False,
        index_searchable=True,
        vectorize_property_name=False,
        skip_vectorization=True,
    ),
    ..., # other properties
    ..., # all with vectorize_property_name=False
    ..., # and skip_vectorization=True,
    wc.Property(
        name="text_to_vectorize",
        data_type=wc.DataType.TEXT,
        index_filterable=False,
        index_searchable=False,
        vectorize_property_name=False,
        skip_vectorization=False,
    ),
]

client.collections.create(
   name="Article",
   properties=articles_schema,
   vectorizer_config=wc.Configure.Vectorizer.text2vec_huggingface(
      model=EMBEDDING_MODEL_NAME,
      vectorize_collection_name=False,
      ),
   )

Thanks

sebawita · May 6, 2024, 12:41pm

Hi @justin.godden,

Your workflow should work, although, I would recommend using the NamedVectors syntax to configure your vectorizers, as it provides a lot cleaner way to define what should be used for vectorization.

Let me provide guidance step by step.

Create a simple collection

First, create a collection with a named vector and specify source_properties.

source_properties – is the list of properties that should be used for vectorization (when a vector is not provided). This syntax is a lot easier to follow that using skip_vectorization

from weaviate.classes.config import Configure

client.collections.create(
    "Article",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_huggingface(
            name="content_vector", 
            model=EMBEDDING_MODEL_NAME,
            source_properties=["title"] # the list of properties used for vectorization
        ),
    ],
)

Notes on NamedVectors.text2vec_huggingface

name – this is the name of your vector space. Since it looks like you will only work with one vector per object, the name doesn’t matter too much.
source_properties – this is the list of properties used for vectorization.

source_properties will only be used if you insert/update an object without providing a vector. So, for your initial import, this will get ignored. But it will be used when you add objects without vectors after.

Also, you don’t need to vectorize_property_name=False and vectorize_collection_name=False as these are set to false by default.

Create a simple collection with (optional) property schema

You can also provide the property schema with named vectors, but that won’t affect the source_properties defined in the named vectors.
Also, providing skip_vectorization in the property schema will be ignored, as the source_properties take precedence.

from weaviate.classes.config import Configure, Property, DataType

client.collections.create(
    "Article",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_huggingface(
            name="content_vector", 
            model=EMBEDDING_MODEL_NAME,
            source_properties=["title", "body"] # the list of properties used for vectorization
        ),
    ],
    properties=[
        Property(name="title", data_type=DataType.TEXT),
        Property(name="body", data_type=DataType.TEXT),
        Property(
              name="author",
              data_type=DataType.TEXT,
              skip_vectorization=False, # this will get ignored, as source_properties already define what should be used for vectorization
        ),
    ],
)

Initial data load

Then you can insert your data with your vectors – and since you will provide your vectors, the vectorizer will not be used.
Here is the example in the docs.

articles = client.collections.get("Article")

with articles.batch.dynamic() as batch:
    for item in your_data_list:
        batch.add_object(
            properties={ # pass the properties of your objects
                "title": item["title"],
                "body": item["body"],
                "author": item["author"],
            },
            vector={ # together with the vector
                "content_vector": item["vector"], # `content_vector` is the name of the vector space
            }
        )

Query

Then you can run a query on your collection, where Weaviate will generate a vector embedding from the provided query.

Note, the query is not affected by source_properties.

articles = client.collections.get("Article")
response = articles.query.near_text(
    query="a sweet German white wine",
    limit=2,
)

justin.godden · May 14, 2024, 11:04am

Hi @sebawita

Thanks for the comprehensive explanation.

That solution looks very clean.

Also, you don’t need to vectorize_property_name=False and vectorize_collection_name=False as these are set to false by default.

Note: both of the above look like they’re set to True by default in the python client.

For vectorize_property_name: weaviate-python-client/weaviate/collections/classes/config.py at 78fa3d30ddc8f6bb12eb922f9f9bd3379b332f0e · weaviate/weaviate-python-client · GitHub

For vectorize_collection_name: weaviate-python-client/weaviate/collections/classes/config_named_vectors.py at 78fa3d30ddc8f6bb12eb922f9f9bd3379b332f0e · weaviate/weaviate-python-client · GitHub

If that is the case, should I add vectorize_collection_name to vector configuration, and add vectorize_property_name to the specific property field that is being used for the vector (text_to_vectorize)?

Assuming I can ignore vectorize_property_name and skip_vectorization on all the other properties as I’m using source_properties.

articles_schema = [
    wc.Property(
        name="other_field",
        data_type=wc.DataType.TEXT,
        # not setting vectorize_property_name
        # or skip_vectorization
    ),
    ..., # other properties
    ..., # none with vectorize_property_name
    ..., # or skip_vectorization as both are ignored
    wc.Property(
        name="text_to_vectorize",
        data_type=wc.DataType.TEXT,
        vectorize_property_name=False, # ADDED THIS
        # skip_vectorization=False, # removed as using source_properties
    ),
]

from weaviate.classes.config import Configure

client.collections.create(
    "Article",
    properties=articles_schema,
    vectorizer_config=[
        Configure.NamedVectors.text2vec_huggingface(
            name="content_vector", 
            model=EMBEDDING_MODEL_NAME,
            source_properties=["text_to_vectorize"],
            vectorize_collection_name=False # ADDED THIS
        ),
    ],
)

Many thanks

sebawita · May 14, 2024, 2:02pm

Oh, good point. I am glad you double-checked the code.
Apologies for my mistake.

Yes, you can skip skip_vectorization - as in, you don’t need it.

Your configuration should do the trick

Topic		Replies	Views
namedVectors with custom embedder? Support python	1	35	July 31, 2024
Issues with Batch Import and Vectorization Support python , technical	1	79	October 11, 2024
[Feedback] Update to the Python client – collections, search, CRUD operations General developer-experience , feedback	18	1313	July 1, 2023
Running Vector Query with Filter on Weaviate v4 Support	3	68	August 27, 2024
VectorIndexConfig not effective? Support	7	504	June 15, 2023

BYOV for indexing and Vector Module for querying?

Create a simple collection

Create a simple collection with (optional) property schema

Initial data load

Query

Related topics