Today I’d like to share my colleague @erika-cardenas’s blog post on distance metrics.
Fundamentally, these metrics enable measuring the similarity or difference between vectors - essentially providing a measure of ‘distance’ in multi-dimensional space.
In the context of a Vector Search, a Distance Metric is a function that calculates a distance value between two vectors, influencing the efficacy of classification, clustering tasks, and especially semantic search. The choice of distance metric can greatly impact the efficacy and performance of the search algorithm, and the article explores a variety of distance metrics including Cosine, Dot product, Euclidean, Manhattan, and Hamming.
What’s particularly intriguing is how different distance metrics can yield differing results, each more suited to particular types of data or use-cases. I would love to hear your thoughts and experiences on this, particularly any insights on how the choice of distance metric has impacted your work with vector search algorithms. Let’s discuss!
Hi replying to this rather than creating a new one, as its related!
We have started using vector indexing to provide some specific search capabilities within our product. One of the key elements is the ability to semantically search across text and images. We have therefore opted to use multi2vec-clip for the vectorisation. We have not changed the default distance metrics (which I believe is cosine).
I am fairly new to this, so am still learning about the pros and cons of different approaches. What is clear is that search results are not quite what I expected them to be especially in the image part. So here is the issue I have. Our test data set is quite small, not sure if that is an issue. We have around 8-10 text entries and around the same number of images. I have tried two types of search - using nearText and passing a general phrase like “2 women by the sea” as well using nearObject and passing in the ID of a specific indexed entry.
nearText - for text I sort of see text matches that make sense near the top with distances of 0.00xx but also others with distance not much further away - how do we decide a specific cutoff threshold? The odd part is the images that come up are at a much greater distance even though the image is actually spot on regarding the search.
nearObject - again for text it seems to be reasonable but the difference in distance between something that makes sense and something that does not is really really tiny - so working out thresholds again is really tricky. For image matching we really dont see good results. We had most images with people near the ocean. only one with a model in a dress against a plain background. but it seemed to match the ocean items regardless of the fact that colours dont even match.
As I said this is quite new to me and I am beginning to readup about distance metrics, but some guidance would be welcome