Vector Distance not consistent with external cosine similarity measure (Sklearn)

In a nutshell this is what i’ve done:

  1. Used this query to get articles that are close to an input prompt, with distance = 0.3:
  2. Saved the returned articles into a seperate class.
  3. To validate, I took 100 random samples from the new class (included) and 100 articles that were not added to the new class (excluded)
  4. I used sklearns cosine similarity implementation as so:
from sklearn.metrics.pairwise import cosine_similarity
# Calculate cosine similarity between each pair of vectors from the two lists
similarity_orig_to_prompt = cosine_similarity(list(dict_excluded_vectors.values()), [text_0_embed])
similarity_trimmed_to_prompt = cosine_similarity(list(dict_included_vectors.values()), [text_0_embed])
  1. here were the average similarity measures:

Both groups pretty much had the same similarity score…
(0.7117798016779362, 0.7196412496249364)

I am wondering why is this the case? It would be good to understand more about what is happening under the hood of with_near_vector. Is the distance metric used here comparable with SKLearn?

Thank you so much, this is becoming a blocker for me and may lead my to consider other options.

Hi @Richard_Y,

Sorry for the late response.
We’ve missed your post.

The two distances seem pretty close: 0.7117 vs 0.71964.

Did you expect the results to be exactly the same?

Perhaps there are some differences in how SKLearn and Weaviate calculate cosine distance. :thinking:

Do you happen to have the two vectors that you were comparing?