Best practice to track which model generated an embedding

I am currently experimenting with Weaviate and am vectorizing my text with OpenAI “text-embedding-ada-002” model, but in the future I might be changing models and of course I never want to query using ModelA on embeddings generated by ModelB.

If you have a database with embeddings from multiple models, what’s the most useful way of keeping track which is which?

As a side beginner question, when I define my schema I only define my data fields but do not need to explicitely declare the fields for the embedding, right?

Would the same apply if I am saving N (where N>1) embeddings for N text fields?

Thanks a lot.

Hi @rjalex !

This information will be stored in the class definition.

So, for example, if you define a class using python v4:

questions = client.collections.create(
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_openai(),  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    generative_config=wvc.Configure.Generative.openai()  # Ensure the `generative-openai` module is used for generative queries

you can inspect that class, and discover the vectorizer:

In [11]: client.collections.get("Question").config.get().vectorizer_config
Out[11]: _VectorizerConfig(model={'baseURL': '', 'model': 'ada', 'modelVersion': '002', 'type': 'text'}, vectorize_collection_name=True)

For your second question, when you define your properties, Weaviate will vectorize it by default, unless you explicitely set skip:True or the data type is not supported to be vectorized.

You can skip a property from vectorization by defining it accordingly at the moduleconfig on that property level settings

Let me know if that helps or if you need further assistance.

Were are here to help!

1 Like

Thank you s much. It is rare to find people so good at explaining complex things to a newbie :slight_smile: All clear now.

I currently have declared my vectorizer as None and am providing my own vectors.

Where is it in the documentation that I can find the list of the vectorizers supported internally by self hosted Weaviate?

Ciao from Rome, Italy

1 Like

Olá from Brazil!!

As you provide the vectors yourself, you can have the vectorizer as None.

However, every object vector must have the same dimension size. Weaviate will raise an error if you try adding vectors of different dimensions in the same class.

The “downside” of that (providing your own vectors without a vectorizer) is that you will not be able to use nearText, for example. Instead, you will vectorize your query, and use the nearVector.

What you can do is adapt our transformer model (GitHub - weaviate/t2v-transformers-models: This is the repo for the container that holds the models for the text2vec-transformers module) so it can use your own model as a container.

This will allow you to vectorize your objects upon ingestion, and use the neartext or hybrid search providing a text query, instead a vector for the query.

Let me know if this helps! If you need any assistance, we are here to help :slight_smile:

1 Like

Muito obrigado, sou ignorante mas prometo aprender rápido. :smiley: