In a nutshell this is what i’ve done:
- Used this query to get articles that are close to an input prompt, with distance = 0.3:
- Saved the returned articles into a seperate class.
- To validate, I took 100 random samples from the new class (included) and 100 articles that were not added to the new class (excluded)
- I used sklearns cosine similarity implementation as so:
from sklearn.metrics.pairwise import cosine_similarity
# Calculate cosine similarity between each pair of vectors from the two lists
similarity_orig_to_prompt = cosine_similarity(list(dict_excluded_vectors.values()), [text_0_embed])
similarity_trimmed_to_prompt = cosine_similarity(list(dict_included_vectors.values()), [text_0_embed])
- here were the average similarity measures:
Both groups pretty much had the same similarity score…
(0.7117798016779362, 0.7196412496249364)
I am wondering why is this the case? It would be good to understand more about what is happening under the hood of with_near_vector. Is the distance metric used here comparable with SKLearn?
Thank you so much, this is becoming a blocker for me and may lead my to consider other options.