New OpenAI Embedding Models

Today OpenAI announced two new embedding models: New embedding models and API updates

text-embedding-3-small and text-embedding-3-large are supposed to be superior to the current text-embedding-ada-002, which is the default for the tex2vec-openai transformer.

What are the plans to incorporate, and what will be the process if one wishes to change their existing cluster objects to one of the newer models?

We are working on adding support for them right now :slight_smile: so they should be available later today

2 Likes

Weaviate v1.23.6 has been released with support for OpenAI’s new V3 embeddings models.

If you want configure new V3 models you just to need to set model property to
text-embedding-3-small or text-embedding-3-large.

You can also additionally set dimensions value, bc with V3 models you can choose vector dimensionality. Here are possible values for dimensions setting:

  • text-embedding-3-small : [512 1536]
  • text-embedding-3-large : [256 1024 3072]

By default text-embedding-3-small model produces 1536 dimension vectors and text-embedding-3-small - 3072 dimension vectors.

1 Like

I believe text-embedding-3-small is 1536 and text-embedding-3-large is 3072.

If I increase to 3072 dimension vectors, that would certainly double my cost, but here is the key question for me: Would it significantly increase the accuracy of my cosine similarity searches? That is the improvement I’d be looking for.

Yeah, that’s also my question and I don’t have the answer to it, bc I haven’t made such a tests, you would have to try it for yourself

As for migration, I need to know how to do this using the API with cURL.

As for the change to my schema, according to this: text2vec-openai | Weaviate - Vector Database

Adding the new models to my schema would look like this:

$schema = [
 "class" => "Solr",
 "description" => "A class representing a Solr index",
 "vectorizer" => "text2vec-openai",
 "moduleConfig" => [
 "text2vec-openai" => [
    "vectorizeClassName" => true
    ],
    "model" => [
        "model" => "text-embedding-3-large"
    ],
    "generative-openai" => [
        "model" => "gpt-4-turbo-preview"
    ]
 ],
    "properties" => [

But, I don’t see any documentation on how to set the vector dimensions to 3072.

Hi!

We have recently updated our docs for the new openai models:

Let me know if this helps.

"vectorizer": "text2vec-openai",
      "moduleConfig": {
        "text2vec-openai": {
          // "model": "ada",
          // "modelVersion": "002",  // Parameter only applicable for `ada` model family and older
          "model": "text-embedding-3-large",
          "dimensions": 3072,  // Parameter only applicable for `v3` model family and newer
          "type": "text",
          "baseURL": "https://proxy.yourcompanydomain.com"  // Optional. Can be overridden by one set in the HTTP header.
        }
      },

Trying to figure out how to make this update on existing Class schemas (in Cloud)…

I’ve tried POSTs and PUTs to /schema/{Class Name} and /schema/{Class Name}/properties.
Also tried passing entire collection schema.

PUTs throw error “moduleconfig is immutable”. POSTs throw error “POST is not allowed” and “property must contain name”.

Can you clarify exactly how to change ada to 3-small/large on existing collection please?

Hi! You will need to reindex your data.

So, basically, creating a new class (on a different cluster or even on the same one) wih the proper configuration, and insert your data in, without providing the vectors.

Note: You will need to change the script so it doesn’t bring your vectors to this new class. Be aware: this will vectorize your objects again.

Here we have a migration script that will guide you thru that:

Let me know if that helps :slight_smile:

Again, we don’t have to necessarily migrate. We can simply create a new class in a new (or same) cluster and re-embed (re-upload) our content. Correct?

That’s right.

If you have your data elsewhere and can upload it to the new collection, it should work fine.

1 Like

I am in the process of migrating an existing DB with old embeddings to a brand new DB with text-embedding-3-large embeddings. I used the cursor API to reindex every item as they’re transferred over. The migration process itself was successful and the relevant class config for the new DB is below:

            'class': 'ClassName',
            'moduleConfig': {
                'text2vec-openai': {
                    'baseURL': 'https://api.openai.com',
                    'dimensions': 1024,
                    'model': 'text-embedding-3-large',
                    'type': 'text',
                    'vectorizeClassName': False
                }
            },
            'multiTenancyConfig': {
                'enabled': True
            },

However, I’m trying to query (hybrid query) this new DB (still on Python client v3) and I’m getting “dense search: search index ClassName: shard ShardName: vector search: vector lengths don’t match: 1024 vs 3072”

I’ll note that during migration my new DB was on 1.23.8, and immediately after migration my sanity check queries worked. Then I upgraded to 1.23.9 and my queries returned this error which persists even if I downgrade back to 1.23.8.

Can anyone help?

1 Like

First off, thank you for the post. You just reminded me that I needed to enable multiTenancyConfig in my new schema!

I do not know the answer to your question, but I’m not sure if you are aware that the default dimensions for text-embedding-3-large is 3072. You can set it to lower dimensions, but I don’t know how you do that in Weaviate. So, I am also looking for the answer to your question!

Thanks for the comment. I’m aware that the default dimension is 3072. I chose 1024 explicitly as that fits our use case better. I’m wondering if in the course of adding support for these new models that some things were forgotten, e.g. the python SDK itself not being dimension size aware.

Have you tried bypassing the sdk and sending your query directly to the API using curl? Queries in detail | Weaviate - Vector Database

Or, through WCS? Just to see if you get the same errors?

Great suggestion. I was able to replicate the exact same error message through graphql curl directly.

curl -X POST -H 'Content-Type: application/json' -d '{"query": "{ Get { ClassName ( tenant: \"MyTenant\", limit: 5, nearText: { concepts: [\"text\"] } ) { attribute1 attribute2 } } }"}' http://server/v1/graphql

Seems to suggest a server-side query issue…

1 Like

Yes, I absolutely agree. Either there is some other config for lowering the dimensions on text-embedding-3-large, or there is a bug in moduleConfig.

Needs to be addressed ASAP!!

@DudaNogueira

Hi there!

I was not able to reproduce this =\

Unfortunately the dimensionsis not yet supported for this vectorizer in the client, but it works in server. I created a class with this:

from weaviate import classes as wvc

client.collections.delete("OpenAiLarge")
col = client.collections.create(
    name="OpenAiLarge",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
        model="text-embedding-3-large"
    )
)

config = client.collections.export_config("OpenAiLarge").to_dict()
config['moduleConfig']["text2vec-openai"]["dimensions"] = 1024
config['class'] = "OpenAiLarge1024"
client.collections.create_from_dict(config)

So basically create the collection from a dict. that does specify the dimensions.

Now, All objects added to OpenAiLarge1024 has 1024 dimensions, and neartext is working as expected.

Can you create a reproducible notebook?

Edit: this using latest 1.23.9 and client 4.4.4

1 Like

Hi @DudaNogueira thanks for attempting to recreate this.

My deployment environment is fly.io using weaviate’s published docker images and I’m unsure how to provide you with reproducible steps. That said, I’ve been able to reproduce the problem 3 separate times now. Let me try to reiterate the steps and perhaps it’ll help you narrow down what the problem might be or what else I can try.

My latest attempt consists of the following steps:

  1. Create a weaviate Fly server instance using semitechnologies/weaviate:1.23.10 docker image. /var/lib/weaviate is mapped to a persistent storage volume.
  2. Create a collection, let’s call it MyCollection. As mentioned before, it uses text2vec-openai vectorizer with model of text-embedding-3-large and dimensions of 1024, exactly as you’ve configured above.
  3. Import some items into MyCollection from an existing Weaviate instance (without the vectors) and verified that the items were re-embedded with vector size being 1024.
  4. Backed up the db to S3

=> At this point, I am able to query (in our case hybrid query) the database with no problems at all.

  1. Restart the server. Verify the collection config is as configured.

=> At this point, our hybrid queries fail with the vector length mismatch error message, e.g. Query call with protocol GRPC search failed with message dense search: search index MyCollection: shard MyCollection_shard: vector search: vector lengths don't match: 1024 vs 3072

  1. Deleted the collection. Restored it from S3 backup. Verified objects exist.

=> Same vector length mismatch errors when performing a hybrid query.

I’ve managed our weaviate infra on Fly for close to a year now and server restarts or docker image upgrades have never caused issues. Queries were done via client 4.4.4 (GRPC) and direct graphql curl commands.

Let me know your thoughts, thanks!

Worked for me but this seem a bit too verbose

    def setup_products_collection(self):
        if not self.client.collections.exists("Products"):
            products = self.client.collections.create(
            name="Products",
            vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(
                model="text-embedding-3-large"

            ),  # This means all data injected in this collection will be vectorised using openai ada
            generative_config=wvc.config.Configure.Generative.openai(),  # Doc mentions module is used for generative queries, not sure what it is yet
            properties=[
                wvc.config.Property(
                    name="title",
                    data_type=wvc.config.DataType.TEXT,
                    vectorize_property_name=False,  # Dont vectorise the 'title' word itself
                    tokenization=wvc.config.Tokenization.WORD  # Keep only alpha-numeric characters, lowercase them, and split by whitespace. check https://weaviate.io/developers/weaviate/config-refs/schema#property-tokenization
                ),
                wvc.config.Property(
                    name="description",
                    data_type=wvc.config.DataType.TEXT,
                    vectorize_property_name=False,
                    skip_vectorization=False,  # Don't vectorize this property
                    tokenization=wvc.config.Tokenization.WORD
                ),
                wvc.config.Property(
                    name="ingredients",
                    data_type=wvc.config.DataType.TEXT,
                    vectorize_property_name=True,  # Use "title" as part of the value to vectorize
                    tokenization=wvc.config.Tokenization.WORD
                ),
                wvc.config.Property(
                    name="nutrition_info",
                    data_type=wvc.config.DataType.TEXT,
                    vectorize_property_name=True,
                    tokenization=wvc.config.Tokenization.WORD
                ),
            ],
        ) 
            print("Products collection created.")
            config = self.client.collections.export_config("Products").to_dict()
            self.client.collections.delete("Products")
            config['moduleConfig']["text2vec-openai"]["dimensions"] = 3047
            config['class'] = "Products"
            self.client.collections.create_from_dict(config)
            return products
        else:
            print("Products collection already exists.")
            return None

I could not find any documentation on the .export_config fun, this could be helpfull in schema backup/restores/migrations etc…

Also, how exactly can one use the create schema from json? We would really need an automatic scripted setup method as we get ready for production.

client.schema.create('./schema/my_schema.json')

Can you recommend some resources on scripted setup methods please?

Many thanks