Distance metrics in vector search

Hi everyone,

Today I’d like to share my colleague @erika-cardenas’s blog post on distance metrics.

Fundamentally, these metrics enable measuring the similarity or difference between vectors - essentially providing a measure of ‘distance’ in multi-dimensional space.

In the context of a Vector Search, a Distance Metric is a function that calculates a distance value between two vectors, influencing the efficacy of classification, clustering tasks, and especially semantic search. The choice of distance metric can greatly impact the efficacy and performance of the search algorithm, and the article explores a variety of distance metrics including Cosine, Dot product, Euclidean, Manhattan, and Hamming.

What’s particularly intriguing is how different distance metrics can yield differing results, each more suited to particular types of data or use-cases. I would love to hear your thoughts and experiences on this, particularly any insights on how the choice of distance metric has impacted your work with vector search algorithms. Let’s discuss!

1 Like

Hi replying to this rather than creating a new one, as its related!

We have started using vector indexing to provide some specific search capabilities within our product. One of the key elements is the ability to semantically search across text and images. We have therefore opted to use multi2vec-clip for the vectorisation. We have not changed the default distance metrics (which I believe is cosine).

I am fairly new to this, so am still learning about the pros and cons of different approaches. What is clear is that search results are not quite what I expected them to be especially in the image part. So here is the issue I have. Our test data set is quite small, not sure if that is an issue. We have around 8-10 text entries and around the same number of images. I have tried two types of search - using nearText and passing a general phrase like “2 women by the sea” as well using nearObject and passing in the ID of a specific indexed entry.

nearText - for text I sort of see text matches that make sense near the top with distances of 0.00xx but also others with distance not much further away - how do we decide a specific cutoff threshold? The odd part is the images that come up are at a much greater distance even though the image is actually spot on regarding the search.

nearObject - again for text it seems to be reasonable but the difference in distance between something that makes sense and something that does not is really really tiny - so working out thresholds again is really tricky. For image matching we really dont see good results. We had most images with people near the ocean. only one with a model in a dress against a plain background. but it seemed to match the ocean items regardless of the fact that colours dont even match.

As I said this is quite new to me and I am beginning to readup about distance metrics, but some guidance would be welcome

hi @systemz!

Were you able to answer this question? I believe that with a small dataset, your distances results will get closer :thinking:

Let me know what you have found out.

Thanks!

@DudaNogueira

from weaviate.classes.query import MetadataQuery

jeopardy = client.collections.get("Taxonomy_msmarco")
response = jeopardy.query.near_vector(
    near_vector=query_vector,
    limit=5,
     distance=0.10,
    return_metadata=MetadataQuery(distance=True)
)

for o in response.objects:
    print(o.properties)
    print(o.metadata.distance)

In this code snippet , what is the distance metric used (by default)?

hi @Ashish_Kumar !!

Welcome to our community :hugs:

the distance metric, as you requested this metadata to be returned, is the distance calculation between your query and your stored objects.

By default, when you creation a collection, the default metrics calculation used is cosine, as stated here:

Now, when you do a bm25 search, you get a score, instead of a distance.

The same goes to hybrid, where a similarity search (near_vector or near_text) is performed alongside a bm25, and those are fused.

Let me know if that helps!

THanks!