Cosine similarity between unrelated keywords return a high score

abdimussa · June 12, 2024, 8:01am

Description

So I have a Competency collection created that contains keyword and skills as a property. I then proceeded to add 2 data that are unrelated with each other. Then I fetch those 2 data back from weaviate alongside their vectors and then perform cosine similarity. However, I’m getting a higher score in the similarity even though those are unrelated with each other. Below is my setup:

import weaviate
import weaviate.classes.config as wc

headers = {"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")}

client = weaviate.connect_to_wcs(
        cluster_url=os.getenv("WEAVIATE_URL"),
        auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WEAVIATE_API_KEY")),
        headers=headers,
    )

 #creating the competency collection
client.collections.create(
        name="Competency",
        properties=[
            wc.Property(
                name="keyword", data_type=wc.DataType.TEXT
            ),
            wc.Property(
                name="tags", data_type=wc.DataType.TEXT
            ),
        ],
        # Define the vectorizer module
     vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
        # Define the generative module
        generative_config=wc.Configure.Generative.openai(),
)

# adding 2 unrelated data to the collection
competency = client.collections.get("Competency") 
competency.data.insert(
        properties={"keyword": "sleeping", "tags": "skill"},
    )
competency.data.insert(
        properties={"keyword": "python programming", "tags": "skill"},
    )

# getting the objects added
vectors = []
for item in competency.iterator(include_vector=True ):
    print(item.vector)
    vectors.append(item.vector)

# performing similarity between the 2 vectors
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(v1, v2):
    return dot(v1, v2) / (norm(v1) * norm(v2))


similarity = cosine_similarity(vectors[0]['default'], vectors[1]['default'])
print(similarity) #outputs 0.9069379570795515

Server Setup Information

Weaviate Server Version: 1.25.2
Deployment Method: Cloud provided by weaviate

Any additional Information

sebawita · June 12, 2024, 10:03am

Hi @abdimussa,

The reason, you get such a high similarity is because both the keyword and tags get vectorised together, and due to similarity of both "tags": "skill" you get similar vectors.

Skip Vectorization

You can configure your collection to vectorize specific properties, using skip_vectorization on your properties, like this:

from weaviate.classes.config import Configure, Property, DataType

client.collections.create(
    name="Competency",
    properties=[
        Property(
            name="keyword", data_type=DataType.TEXT
        ),
        Property(
            name="tags", data_type=DataType.TEXT, skip_vectorization=True # <== don't vectorize this property
        ),
    ],
    
    # Define the vectorizer module
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    
    # Define the generative module
    generative_config=Configure.Generative.openai(),
)

Named Vectors with Source Properties

Or you can use named vectors and source_properties (my preference), like this:

from weaviate.classes.config import Configure, Property, DataType

client.collections.create(
    name="Competency",
    properties=[
        Property(
            name="keyword", data_type=DataType.TEXT
        ),
        Property(
            name="tags", data_type=DataType.TEXT
        ),
    ],
    
    # Define the vectorizer module
    vectorizer_config=[
        Configure.NamedVectors.text2vec_openai(
            name="default",
            source_properties=["keyword"], # <== list the properties to be used for vectorization
        )
    ],
    
    # Define the generative module
    generative_config=Configure.Generative.openai(),
)

abdimussa · June 12, 2024, 10:33am

Hi @sebawita , I tried by even removing the tags property and create the collection only with the keyword property. I’m still getting a very high cosine similarity.

sebawita · June 12, 2024, 10:55am

Another thing you can try. By default Weaviate uses the property name as part of the content to be vectorised. I am not sure how much this affects the generated vector, but you can change it with vectorize_property_name on the property like this:

client.collections.create(
    name="Competency",
    properties=[
        Property(
            name="keyword", data_type=DataType.TEXT, vectorize_property_name=False # <== don't vectorize the property name
        ),
        Property(
            name="tags", data_type=DataType.TEXT, skip_vectorization=True # <== don't vectorize this property
        ),
    ],
    
    # Define the vectorizer module
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    
    # Define the generative module
    generative_config=Configure.Generative.openai(),
)

If your similarity is still too similar, this is mostly due to the model used, as Weaviate only asks (in this case) OpenAI to generate vectors. So, you would get the same results even if you call OpenAI APIs directly.

Also, in some cases models perform better when your input text is longer vs one-word content. But I am not sure how OpenAI Ada performs with short vs long content.

abdimussa · June 12, 2024, 12:24pm

I removed the tags property entirely so that keywords will be the only ones available. I also did try embedding directly using OpenAi instead of through weaviate and the cosine similarity of the returned vectors does correctly translate to how the vectors are related. So I’m not sure how using weaviate is returning such a high cosine similarity for unrelated keywords.

sebawita · June 12, 2024, 3:49pm

hmmm… what model did you use when calling OpenAI directly?

I think Weaviate uses Ada-002 by default. You can change it to anything you need

Here you can see a list of available setting for the OpenAI vectorizer

sebawita · June 12, 2024, 3:52pm

Also, did you try to add vectorize_property_name on the vectorized property?

Property(
    name="keyword",
    data_type=DataType.TEXT,
    vectorize_property_name=False # <== don't vectorize the property name
),

abdimussa · June 13, 2024, 8:58am

The issue was with the ada vectorizer(the dafault). I changed it to text-embedding-3-small and it has now improved a lot. Thank you @sebawita !

Topic		Replies	Views
Cosine similarity differs between ScikitLearn and Weaviate for SentenceTransformer vectors Support bug , developer-experience , python , technical	0	167	December 25, 2024
Weaviate cosine similarity completelly different than ScikitLearn with SentenceTransformer vectorizer Support bug , developer-experience , python , technical	2	277	August 6, 2025
How weaviate calculates score in similarity_search_with_score? Support	4	633	July 2, 2024
Vector Distance not consistent with external cosine similarity measure (Sklearn) Support	1	552	February 15, 2024
nearText algorithm not returning expected value compared to cosine similarity General	1	255	June 20, 2024