Searching on two different classes with same objects but different vectorizers

alt-glitch · July 11, 2023, 5:08am

Hi, as the title says, I am attempting to judge the OpenAI Ada Embedding model and SBERTs MiniLM model to see which one results in better search results and scores. I have created two classes with identical objects but different vectorizers: one with OpenAI and one with no vectorizer (I add the vectors during object upload).

My assumption is that when I do a hybrid search for one query in these two different classes, I should get ~~similar but not the same scores on a returned , given I’m using different vectorizers. However that is not the case: for a hybrid search on the same query on these two different classes, I get the same scores. As this intended behaviour or am I doing something wrong?
My end goal is to evaluate which embedding model performs better.

Here’s the schema and code for the same:

class_obj_openai = {
    'class': 'openai',
    'properties': [
        {
            'name': 'title',
            'dataType': ['text']
        },
        {
            'name': 'source',
            'dataType': ['text']
        },
        {
            'name': 'content',
            'dataType': ['text']
        },
    ],
    'vectorizer': 'text2vec-openai',
    'moduleConfig': {
        'tect2vec-openai': {
            'vectorizeClassName': False,
            'model': 'ada',
            'modelVersion': '002',
            'type': 'text'
        }
    }
}


class_obj_minilm = {
    'class': 'minilm',
    'properties': [
        {
            'name': 'title',
            'dataType': ['text']
        },
        {
            'name': 'source',
            'dataType': ['text']
        },
        {
            'name': 'content',
            'dataType': ['text']
        },
    ],
    'vectorizer': 'none'
}

client.schema.create_class(class_obj_minilm)
client.schema.create_class(class_obj_openai)

# Import data into MiniLM class

with client.batch(batch_size=100, num_workers=10) as batch:
    # Batch import all Questions
    for i, d in enumerate(entries):
        print(f"importing question: {i+1}")

        properties = {
            "title": d["metadata"]["title"],
            "source": d["metadata"]["source"],
            "content": d["page_content"],
        }

        client.batch.add_data_object(properties, "minilm", vector=d["vector"])

# Import data into OpenAI class

with client.batch(batch_size=100, num_workers=1) as batch:
    # Batch import all Questions
    for i, d in enumerate(entries):
        print(f"importing question: {i+1}")

        properties = {
            "title": d["metadata"]["title"],
            "source": d["metadata"]["source"],
            "content": d["page_content"],
        }

        client.batch.add_data_object(properties, "openai")

jphwang · July 11, 2023, 8:21am

Hi @alt-glitch - hybrid scores are currently based on ranking of the hits, rather than scores.

So I think it makes sense that objects end up with the same score even though different vectorizers are used.

With 1.20 - we are adding a new hybrid fusion algorithm that is based on the raw scores. (Keep an eye out for our release blog in the next day or two )

Just a word of note though - similarity scores aren’t in themselves indicative of model performance. For that I might look into benchmarks or something like that, where you’re measuring things like recall for instance for a given dataset.

Take a look at this, for instance: MTEB: Massive Text Embedding Benchmark

alt-glitch · July 13, 2023, 9:54am

Ah got it. But isn’t it odd that even the rankings of search using different embeddings is exactly the same, or is this expected.

Moreoever thank you so much for the MTEB link.

I’ll start checking out the 1.20 right away

Topic		Replies	Views
Hybrid Queries on new OpenAI Embedding Models failing server restart Support	15	447	January 8, 2025
Similarity search returns chunks that all have exactly the same distance value Support bug	3	724	November 29, 2023
Best practice to track which model generated an embedding General	4	622	January 11, 2024
Vectorizer for hybrid search Support	1	291	January 22, 2024
Benchmarking two vectorizers - best pattern? General	2	270	January 16, 2024

Searching on two different classes with same objects but different vectorizers

Related topics