Benchmarking two vectorizers - best pattern?

I am in the need of comparing the precision and recall of two different vectorizers on a test corpus of 150000 titles and kickers (in Italian) as my test.

One of them is the OpenAI “text-embedding-ada-002”. So far I have declared a schema for this (with vectorizer = none but guess I might restart Weaviate with this module and let the vectorization happen in the DB and not in my application).

The second vectorizer is SentenceTransformer(“nickprock/sentence-bert-base-italian-uncased”) and this I am adding via my application (I still don’t understand if I can configure this external model as a module).

Until now I filled up the WpAdmin schema with data and the OpeanAI vectorization, made measurements, then wiped Weaviate clean, restarted and repeated with the other vectorizer.

By reading in this forum maybe a better idea would be to declare two distinct schemas?

This is my current definition of the schema I am custom vectorizing:


def weaviate_create_wp_schema(weclient: weaviate.client.Client, classname: str) -> None:
    """define the data schema (class) as a JSON object to be stored in Weaviate"""
    schema = {
        "classes": [
            {
                "class": classname,  # first letter capitalized regardless of how it is here
                "description": "Contains the WP article text data along with embeddings",
                "vectorizer": "none",  # we will provide the embedding, not let Weaviate do it
                "properties": [
                    {
                        "name": "slug",
                        "dataType": ["text"],
                        "vectorize": False,  # This field will NOT be vectorized
                    },
                    {
                        "name": "kicker",
                        "dataType": ["text"],
                    },
                ],
            }
        ]
    }
    if not check_class_exists(weclient, classname):
        weclient.schema.create(schema)

Thanks for any suggestion or example.

Hi!

Yes! For now you will need to define two schemas.

Hopefully, AFAIK, in 1.24 (current roadmap) you will be able to have multiple vectors per object. So this will open up the possibility easily comparing different models using the very same object.

regarding the usage of your custom model as a vectorizer module, one thing you could do is to adapt this container to use your own model:

Let me know if this helps :slight_smile:

1 Like

Thanks will try as soon as I can.