Best practice to track which model generated an embedding

rjalex · January 10, 2024, 1:35pm

I am currently experimenting with Weaviate and am vectorizing my text with OpenAI “text-embedding-ada-002” model, but in the future I might be changing models and of course I never want to query using ModelA on embeddings generated by ModelB.

If you have a database with embeddings from multiple models, what’s the most useful way of keeping track which is which?

As a side beginner question, when I define my schema I only define my data fields but do not need to explicitely declare the fields for the embedding, right?

Would the same apply if I am saving N (where N>1) embeddings for N text fields?

Thanks a lot.

DudaNogueira · January 11, 2024, 1:20am

Hi @rjalex !

This information will be stored in the class definition.

So, for example, if you define a class using python v4:

questions = client.collections.create(
    name="Question",
    vectorizer_config=wvc.Configure.Vectorizer.text2vec_openai(),  # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
    generative_config=wvc.Configure.Generative.openai()  # Ensure the `generative-openai` module is used for generative queries
)

you can inspect that class, and discover the vectorizer:

In [11]: client.collections.get("Question").config.get().vectorizer_config
Out[11]: _VectorizerConfig(model={'baseURL': 'https://api.openai.com', 'model': 'ada', 'modelVersion': '002', 'type': 'text'}, vectorize_collection_name=True)

For your second question, when you define your properties, Weaviate will vectorize it by default, unless you explicitely set skip:True or the data type is not supported to be vectorized.

You can skip a property from vectorization by defining it accordingly at the moduleconfig on that property level settings

Let me know if that helps or if you need further assistance.

Were are here to help!

rjalex · January 11, 2024, 7:26am

Thank you s much. It is rare to find people so good at explaining complex things to a newbie All clear now.

I currently have declared my vectorizer as None and am providing my own vectors.

Where is it in the documentation that I can find the list of the vectorizers supported internally by self hosted Weaviate?

Ciao from Rome, Italy

DudaNogueira · January 11, 2024, 12:52pm

Olá from Brazil!!

As you provide the vectors yourself, you can have the vectorizer as None.

However, every object vector must have the same dimension size. Weaviate will raise an error if you try adding vectors of different dimensions in the same class.

The “downside” of that (providing your own vectors without a vectorizer) is that you will not be able to use nearText, for example. Instead, you will vectorize your query, and use the nearVector.

What you can do is adapt our transformer model (GitHub - weaviate/t2v-transformers-models: This is the repo for the container that holds the models for the text2vec-transformers module) so it can use your own model as a container.

This will allow you to vectorize your objects upon ingestion, and use the neartext or hybrid search providing a text query, instead a vector for the query.

Let me know if this helps! If you need any assistance, we are here to help

rjalex · January 11, 2024, 5:45pm

Muito obrigado, sou ignorante mas prometo aprender rápido.

Topic		Replies	Views
Benchmarking two vectorizers - best pattern? General	2	370	January 16, 2024
Comparing self managed embeddings General developer-experience	1	229	April 16, 2024
No vector found after configuring vectorizer! Support	1	268	July 7, 2024
Best practice for fast embedding with OpenAI? ( or similar performance ) General technical	1	127	June 12, 2025
Running Vector Query with Filter on Weaviate v4 Support	3	218	August 27, 2024

Best practice to track which model generated an embedding

Related topics