Distance metrics in vector search

dandv · May 17, 2023, 10:00pm

Hi everyone,

Today I’d like to share my colleague @erika-cardenas’s blog post on distance metrics.

Fundamentally, these metrics enable measuring the similarity or difference between vectors - essentially providing a measure of ‘distance’ in multi-dimensional space.

In the context of a Vector Search, a Distance Metric is a function that calculates a distance value between two vectors, influencing the efficacy of classification, clustering tasks, and especially semantic search. The choice of distance metric can greatly impact the efficacy and performance of the search algorithm, and the article explores a variety of distance metrics including Cosine, Dot product, Euclidean, Manhattan, and Hamming.

What’s particularly intriguing is how different distance metrics can yield differing results, each more suited to particular types of data or use-cases. I would love to hear your thoughts and experiences on this, particularly any insights on how the choice of distance metric has impacted your work with vector search algorithms. Let’s discuss!

systemz · August 24, 2023, 9:50am

Hi replying to this rather than creating a new one, as its related!

We have started using vector indexing to provide some specific search capabilities within our product. One of the key elements is the ability to semantically search across text and images. We have therefore opted to use multi2vec-clip for the vectorisation. We have not changed the default distance metrics (which I believe is cosine).

I am fairly new to this, so am still learning about the pros and cons of different approaches. What is clear is that search results are not quite what I expected them to be especially in the image part. So here is the issue I have. Our test data set is quite small, not sure if that is an issue. We have around 8-10 text entries and around the same number of images. I have tried two types of search - using nearText and passing a general phrase like “2 women by the sea” as well using nearObject and passing in the ID of a specific indexed entry.

nearText - for text I sort of see text matches that make sense near the top with distances of 0.00xx but also others with distance not much further away - how do we decide a specific cutoff threshold? The odd part is the images that come up are at a much greater distance even though the image is actually spot on regarding the search.

nearObject - again for text it seems to be reasonable but the difference in distance between something that makes sense and something that does not is really really tiny - so working out thresholds again is really tricky. For image matching we really dont see good results. We had most images with people near the ocean. only one with a model in a dress against a plain background. but it seemed to match the ocean items regardless of the fact that colours dont even match.

As I said this is quite new to me and I am beginning to readup about distance metrics, but some guidance would be welcome

DudaNogueira · September 12, 2023, 6:46pm

hi @systemz!

Were you able to answer this question? I believe that with a small dataset, your distances results will get closer

Let me know what you have found out.

Thanks!

Ashish_Kumar · September 30, 2024, 4:58pm

@DudaNogueira

from weaviate.classes.query import MetadataQuery

jeopardy = client.collections.get("Taxonomy_msmarco")
response = jeopardy.query.near_vector(
    near_vector=query_vector,
    limit=5,
     distance=0.10,
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.distance)

In this code snippet , what is the distance metric used (by default)?

DudaNogueira · October 1, 2024, 8:19am

hi @Ashish_Kumar !!

Welcome to our community

the distance metric, as you requested this metadata to be returned, is the distance calculation between your query and your stored objects.

By default, when you creation a collection, the default metrics calculation used is cosine, as stated here:

Now, when you do a bm25 search, you get a score, instead of a distance.

The same goes to hybrid, where a similarity search (near_vector or near_text) is performed alongside a bm25, and those are fused.

Let me know if that helps!

THanks!

Topic		Replies	Views
Why distance score not equal to 0 (searching exactly the same words) Support python	3	146	July 18, 2024
Vector search algorithm for hybrid search General	2	468	March 1, 2024
Similarity search returns chunks that all have exactly the same distance value Support bug	3	856	November 29, 2023
How weaviate calculates score in similarity_search_with_score? Support	4	477	July 2, 2024
.near_text results are not satisfactory (distance scores too close) Support neartext	2	878	June 20, 2023

Distance metrics in vector search

Related topics