I am currently experimenting with Weaviate and am vectorizing my text with OpenAI “text-embedding-ada-002” model, but in the future I might be changing models and of course I never want to query using ModelA on embeddings generated by ModelB.
If you have a database with embeddings from multiple models, what’s the most useful way of keeping track which is which?
As a side beginner question, when I define my schema I only define my data fields but do not need to explicitely declare the fields for the embedding, right?
Would the same apply if I am saving N (where N>1) embeddings for N text fields?
This information will be stored in the class definition.
So, for example, if you define a class using python v4:
questions = client.collections.create(
name="Question",
vectorizer_config=wvc.Configure.Vectorizer.text2vec_openai(), # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
generative_config=wvc.Configure.Generative.openai() # Ensure the `generative-openai` module is used for generative queries
)
you can inspect that class, and discover the vectorizer:
For your second question, when you define your properties, Weaviate will vectorize it by default, unless you explicitely set skip:True or the data type is not supported to be vectorized.
As you provide the vectors yourself, you can have the vectorizer as None.
However, every object vector must have the same dimension size. Weaviate will raise an error if you try adding vectors of different dimensions in the same class.
The “downside” of that (providing your own vectors without a vectorizer) is that you will not be able to use nearText, for example. Instead, you will vectorize your query, and use the nearVector.
This will allow you to vectorize your objects upon ingestion, and use the neartext or hybrid search providing a text query, instead a vector for the query.
Let me know if this helps! If you need any assistance, we are here to help