Benchmarking two vectorizers - best pattern?

rjalex · January 12, 2024, 12:18pm

I am in the need of comparing the precision and recall of two different vectorizers on a test corpus of 150000 titles and kickers (in Italian) as my test.

One of them is the OpenAI “text-embedding-ada-002”. So far I have declared a schema for this (with vectorizer = none but guess I might restart Weaviate with this module and let the vectorization happen in the DB and not in my application).

The second vectorizer is SentenceTransformer(“nickprock/sentence-bert-base-italian-uncased”) and this I am adding via my application (I still don’t understand if I can configure this external model as a module).

Until now I filled up the WpAdmin schema with data and the OpeanAI vectorization, made measurements, then wiped Weaviate clean, restarted and repeated with the other vectorizer.

By reading in this forum maybe a better idea would be to declare two distinct schemas?

This is my current definition of the schema I am custom vectorizing:


def weaviate_create_wp_schema(weclient: weaviate.client.Client, classname: str) -> None:
    """define the data schema (class) as a JSON object to be stored in Weaviate"""
    schema = {
        "classes": [
            {
                "class": classname,  # first letter capitalized regardless of how it is here
                "description": "Contains the WP article text data along with embeddings",
                "vectorizer": "none",  # we will provide the embedding, not let Weaviate do it
                "properties": [
                    {
                        "name": "slug",
                        "dataType": ["text"],
                        "vectorize": False,  # This field will NOT be vectorized
                    },
                    {
                        "name": "kicker",
                        "dataType": ["text"],
                    },
                ],
            }
        ]
    }
    if not check_class_exists(weclient, classname):
        weclient.schema.create(schema)

Thanks for any suggestion or example.

DudaNogueira · January 15, 2024, 6:42pm

Hi!

Yes! For now you will need to define two schemas.

Hopefully, AFAIK, in 1.24 (current roadmap) you will be able to have multiple vectors per object. So this will open up the possibility easily comparing different models using the very same object.

regarding the usage of your custom model as a vectorizer module, one thing you could do is to adapt this container to use your own model:

Let me know if this helps

rjalex · January 16, 2024, 12:33pm

Thanks will try as soon as I can.

Topic		Replies	Views
Best practice to track which model generated an embedding General	4	908	January 11, 2024
I don't understand the weaviate schema structure Support	7	1516	November 2, 2023
Can we use different vectorizers for different tenants in multi-tenant collection? Support	1	299	January 24, 2025
Need help to use my own vectorizer and generative model Support integration , wcs	5	634	July 8, 2024
Looking for a way to vectorize a data object using WCS internal vectorizer module General	1	566	July 7, 2023

Benchmarking two vectorizers - best pattern?

Related topics