Weaviate cosine similarity completelly different than ScikitLearn with SentenceTransformer vectorizer

ziemowit-s · December 25, 2024, 12:06pm

Hi guys!

I noticed a strange but important issue (bug?) with cosine similarity in Weaviate. I’m using external SentenceTransformer for vectorization, and when I compare cosine similarity obtained with ScikitLearn - it’s strangely completely different.

Since I want to be precise I will past my code with both texts where I noticed the difference (they are in polish, but for those who doesn’t know → the first one is the closest answer), hopefully it’s still be pleasant to read

Sentence Transformer vectorizer:

model = SentenceTransformer("distiluse-base-multilingual-cased-v1")

My text query and 2 texts to compare:

query = "jakie są kary w kodeksie karnym?"

text1 = "kodeks karny. Start. Kary. 32. Katalog kar. Karami są:1)grzywna;2)ograniczenie wolności;3)pozbawienie wolności;4)(uchylony);5)dożywotnie pozbawienie wolności."

text2 = """kodeks karny skarbowy. Część ogólna. Przestępstwa skarbowe. 22. Kary oraz środki karne i zabezpieczające. § 1.Karami za przestępstwa skarbowe są:1)kara grzywny w stawkach dziennych;2)kara ograniczenia wolności;3)kara pozbawienia wolności.
§ 2.Środkami karnymi są:1)dobrowolne poddanie się odpowiedzialności;2)przepadek przedmiotów;3)ściągnięcie równowartości pieniężnej przepadku przedmiotów;4)przepadek korzyści majątkowej;4a)ściągnięcie równowartości pieniężnej przepadku korzyści majątkowej;5)zakaz prowadzenia określonej działalności gospodarczej, wykonywania określonego zawodu lub zajmowania określonego stanowiska;6)podanie wyroku do publicznej wiadomości;7)pozbawienie praw publicznych;8)środki związane z poddaniem sprawcy próbie:a)warunkowe umorzenie postępowania karnego,b)warunkowe zawieszenie wykonania kary,c)warunkowe zwolnienie.
§ 3.Środkami zabezpieczającymi są:1)elektroniczna kontrola miejsca pobytu;2)terapia;3)terapia uzależnień;4)pobyt w zakładzie psychiatrycznym;5)przepadek przedmiotów;6)zakazy wymienione w § 2 pkt 5."""

ScikitLearn cosine implementation:

query_embedding = model.encode(query)
sentence1_embedding = model.encode(ss1)
sentence2_embedding = model.encode(ss2)

# Compute cosine similarity
similarity = cosine_similarity([query_embedding], [sentence1_embedding])
print("Text 1 Score:", similarity[0][0])
# score: 0.28450348362745925

similarity = cosine_similarity([query_embedding], [sentence2_embedding])
print("Text 2 Score:", similarity[0][0])
# score: 0.5143749261108472

ScikitLearn results are:

for the text 1: 0.28450348362745925
for the text 2: 0.5143749261108472

As you can see SciKit cosine silimarity correctly distinguishes first text as more similar than the second one.

Weaviate: collection creation

vector_config = Configure.VectorIndex.hnsw(
      distance_metric=VectorDistances.COSINE
)

collection = client.collections.create(collection_name,
                                                                 vector_index_config=vector_config,
                                                                 vectorizer_config=Configure.Vectorizer.none())

Weaviate: batch adding

embeddings = model.encode(query).tolist()
with collection.batch.dynamic() as batch:
    for vector, data_row in zip(embeddings, docs):
        obj_uuid = generate_uuid5(data_row)
        batch.add_object(
            properties=data_row,
            uuid=obj_uuid,
            vector=vector
)

Weaviate: near vector search

from weaviate.collections.classes.grpc import MetadataQuery

vector = vectorizer.model.encode(search_query).tolist()

results = obj.collection.query.near_vector(
    near_vector=vector,  # your query vector goes here
    limit=2,
    return_metadata=MetadataQuery(distance=True, certainty=True))

for rr in results.objects:
    print(rr.metadata.distance)
    print(rr.properties['content'])

Weaviate results are:

for the text 1: 0.7154964208602905
for the text 2: 0.4856252074241638

for testing I added only those 2 documents to the database.

I also checked if stored vectors are the same with:

weaviate_vec = (obj.collection.query.fetch_object_by_id('id_of_first_object', include_vector=True).vector['default'])

np.sum(np.array(weaviate_vec) - np.array(sentence1_embedding))
# 0

and they are.

Do anyone has any idea what’s might going on? I will be very appreciate of your help

Magdalena_Sochacka · January 14, 2025, 2:01pm

Hey. Just a suggestion, you used embedding with sklearn as sentence transformer and I’m not sure, but check if in weaviate you have had vectorised it like sentence transformers way or basic like hugging face. I’ve just seen this on custom model page and maybe this is also the case:

Topic		Replies	Views
Cosine similarity differs between ScikitLearn and Weaviate for SentenceTransformer vectors Support bug , developer-experience , python , technical	0	65	December 25, 2024
Benchmarking two vectorizers - best pattern? General	2	316	January 16, 2024
Cosine similarity between unrelated keywords return a high score Support python	7	239	June 13, 2024
Problem with Q&A using Local vectorization model (text2vec-transformers) Support	2	156	May 14, 2024
Similarity search returns chunks that all have exactly the same distance value Support bug	3	768	November 29, 2023

Weaviate cosine similarity completelly different than ScikitLearn with SentenceTransformer vectorizer

Related topics