Hi there,
I am trying to understand how the nearText algorithm works. My expectation is that it uses cosine similarity (since it is the default metric) to perform the similarity between two different embeddings.
When I tried this, it returns different results. Just want to see how I can match the two values?
Current testing:
Weaviate embedding model - text2vec-openai (hence default = text-embedding-ada-002)
Testing with cosine similarity:
- using text-embedding-ada-002 to embed the text (openAI API)
- perform cosine similarity (with below function)
def _cosine_similarity(vec1: np.array, vec2: np.array):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
May I know what are some of the discrepancies there and if possible, how I can use nearText and output results that match with the cosine similarity function?
Cheers!
Hi @Teng_Hoo !!
Here is how you can check the calculation using the cosine:
import weaviate
from weaviate import classes as wvc
from weaviate.util import generate_uuid5
client = weaviate.connect_to_local()
client.collections.delete("Collection")
collection = client.collections.create(
"Collection",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai()
)
collection.data.insert({"text": "Something about cat"}, uuid=generate_uuid5("cat"))
collection.data.insert({"text": "That house is beautiful"}, uuid=generate_uuid5("house"))
# now comparing text1 vs text2
from weaviate.classes.query import Filter
results = collection.query.near_object(
near_object=generate_uuid5("cat"),
return_metadata=wvc.query.MetadataQuery(distance=True)
)
for object in results.objects:
print(object.properties, object.metadata.distance)
# output
#{'text': 'Something about cat'} 0.0
# {'text': 'That house is beautiful'} 0.1678454875946045
# now using your function
import numpy as np
results = collection.query.fetch_objects(include_vector=True)
vec1 = results.objects[0].vector.get("default")
vec2 = results.objects[1].vector.get("default")
def _cosine_similarity(vec1: np.array, vec2: np.array):
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(_cosine_similarity(vec1, vec2) - 1)
# output
# -0.16784559529191356
Let me know if this helps!
Thanks!