I am in the need of comparing the precision and recall of two different vectorizers on a test corpus of 150000 titles and kickers (in Italian) as my test.
One of them is the OpenAI “text-embedding-ada-002”. So far I have declared a schema for this (with vectorizer = none but guess I might restart Weaviate with this module and let the vectorization happen in the DB and not in my application).
The second vectorizer is SentenceTransformer(“nickprock/sentence-bert-base-italian-uncased”) and this I am adding via my application (I still don’t understand if I can configure this external model as a module).
Until now I filled up the WpAdmin schema with data and the OpeanAI vectorization, made measurements, then wiped Weaviate clean, restarted and repeated with the other vectorizer.
By reading in this forum maybe a better idea would be to declare two distinct schemas?
This is my current definition of the schema I am custom vectorizing:
def weaviate_create_wp_schema(weclient: weaviate.client.Client, classname: str) -> None:
"""define the data schema (class) as a JSON object to be stored in Weaviate"""
schema = {
"classes": [
{
"class": classname, # first letter capitalized regardless of how it is here
"description": "Contains the WP article text data along with embeddings",
"vectorizer": "none", # we will provide the embedding, not let Weaviate do it
"properties": [
{
"name": "slug",
"dataType": ["text"],
"vectorize": False, # This field will NOT be vectorized
},
{
"name": "kicker",
"dataType": ["text"],
},
],
}
]
}
if not check_class_exists(weclient, classname):
weclient.schema.create(schema)
Thanks for any suggestion or example.