text-embedding-3-small and text-embedding-3-large are supposed to be superior to the current text-embedding-ada-002, which is the default for the tex2vec-openai transformer.
What are the plans to incorporate, and what will be the process if one wishes to change their existing cluster objects to one of the newer models?
Weaviate v1.23.6 has been released with support for OpenAI’s new V3 embeddings models.
If you want configure new V3 models you just to need to set model property to text-embedding-3-small or text-embedding-3-large.
You can also additionally set dimensions value, bc with V3 models you can choose vector dimensionality. Here are possible values for dimensions setting:
text-embedding-3-small : [512 1536]
text-embedding-3-large : [256 1024 3072]
By default text-embedding-3-small model produces 1536 dimension vectors and text-embedding-3-small - 3072 dimension vectors.
I believe text-embedding-3-small is 1536 and text-embedding-3-large is 3072.
If I increase to 3072 dimension vectors, that would certainly double my cost, but here is the key question for me: Would it significantly increase the accuracy of my cosine similarity searches? That is the improvement I’d be looking for.
We have recently updated our docs for the new openai models:
Let me know if this helps.
"vectorizer": "text2vec-openai",
"moduleConfig": {
"text2vec-openai": {
// "model": "ada",
// "modelVersion": "002", // Parameter only applicable for `ada` model family and older
"model": "text-embedding-3-large",
"dimensions": 3072, // Parameter only applicable for `v3` model family and newer
"type": "text",
"baseURL": "https://proxy.yourcompanydomain.com" // Optional. Can be overridden by one set in the HTTP header.
}
},
So, basically, creating a new class (on a different cluster or even on the same one) wih the proper configuration, and insert your data in, without providing the vectors.
Note: You will need to change the script so it doesn’t bring your vectors to this new class. Be aware: this will vectorize your objects again.
Here we have a migration script that will guide you thru that:
Again, we don’t have to necessarily migrate. We can simply create a new class in a new (or same) cluster and re-embed (re-upload) our content. Correct?
I am in the process of migrating an existing DB with old embeddings to a brand new DB with text-embedding-3-large embeddings. I used the cursor API to reindex every item as they’re transferred over. The migration process itself was successful and the relevant class config for the new DB is below:
However, I’m trying to query (hybrid query) this new DB (still on Python client v3) and I’m getting “dense search: search index ClassName: shard ShardName: vector search: vector lengths don’t match: 1024 vs 3072”
I’ll note that during migration my new DB was on 1.23.8, and immediately after migration my sanity check queries worked. Then I upgraded to 1.23.9 and my queries returned this error which persists even if I downgrade back to 1.23.8.
First off, thank you for the post. You just reminded me that I needed to enable multiTenancyConfig in my new schema!
I do not know the answer to your question, but I’m not sure if you are aware that the default dimensions for text-embedding-3-large is 3072. You can set it to lower dimensions, but I don’t know how you do that in Weaviate. So, I am also looking for the answer to your question!
Thanks for the comment. I’m aware that the default dimension is 3072. I chose 1024 explicitly as that fits our use case better. I’m wondering if in the course of adding support for these new models that some things were forgotten, e.g. the python SDK itself not being dimension size aware.
Hi @DudaNogueira thanks for attempting to recreate this.
My deployment environment is fly.io using weaviate’s published docker images and I’m unsure how to provide you with reproducible steps. That said, I’ve been able to reproduce the problem 3 separate times now. Let me try to reiterate the steps and perhaps it’ll help you narrow down what the problem might be or what else I can try.
My latest attempt consists of the following steps:
Create a weaviate Fly server instance using semitechnologies/weaviate:1.23.10 docker image. /var/lib/weaviate is mapped to a persistent storage volume.
Create a collection, let’s call it MyCollection. As mentioned before, it uses text2vec-openai vectorizer with model of text-embedding-3-large and dimensions of 1024, exactly as you’ve configured above.
Import some items into MyCollection from an existing Weaviate instance (without the vectors) and verified that the items were re-embedded with vector size being 1024.
Backed up the db to S3
=> At this point, I am able to query (in our case hybrid query) the database with no problems at all.
Restart the server. Verify the collection config is as configured.
=> At this point, our hybrid queries fail with the vector length mismatch error message, e.g. Query call with protocol GRPC search failed with message dense search: search index MyCollection: shard MyCollection_shard: vector search: vector lengths don't match: 1024 vs 3072
Deleted the collection. Restored it from S3 backup. Verified objects exist.
=> Same vector length mismatch errors when performing a hybrid query.
I’ve managed our weaviate infra on Fly for close to a year now and server restarts or docker image upgrades have never caused issues. Queries were done via client 4.4.4 (GRPC) and direct graphql curl commands.
def setup_products_collection(self):
if not self.client.collections.exists("Products"):
products = self.client.collections.create(
name="Products",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
model="text-embedding-3-large"
), # This means all data injected in this collection will be vectorised using openai ada
generative_config=wvc.config.Configure.Generative.openai(), # Doc mentions module is used for generative queries, not sure what it is yet
properties=[
wvc.config.Property(
name="title",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=False, # Dont vectorise the 'title' word itself
tokenization=wvc.config.Tokenization.WORD # Keep only alpha-numeric characters, lowercase them, and split by whitespace. check https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization
),
wvc.config.Property(
name="description",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=False,
skip_vectorization=False, # Don't vectorize this property
tokenization=wvc.config.Tokenization.WORD
),
wvc.config.Property(
name="ingredients",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=True, # Use "title" as part of the value to vectorize
tokenization=wvc.config.Tokenization.WORD
),
wvc.config.Property(
name="nutrition_info",
data_type=wvc.config.DataType.TEXT,
vectorize_property_name=True,
tokenization=wvc.config.Tokenization.WORD
),
],
)
print("Products collection created.")
config = self.client.collections.export_config("Products").to_dict()
self.client.collections.delete("Products")
config['moduleConfig']["text2vec-openai"]["dimensions"] = 3047
config['class'] = "Products"
self.client.collections.create_from_dict(config)
return products
else:
print("Products collection already exists.")
return None
I could not find any documentation on the .export_config fun, this could be helpfull in schema backup/restores/migrations etc…
Also, how exactly can one use the create schema from json? We would really need an automatic scripted setup method as we get ready for production.
client.schema.create('./schema/my_schema.json')
Can you recommend some resources on scripted setup methods please?