Cosine similarity between unrelated keywords return a high score

Description

So I have a Competency collection created that contains keyword and skills as a property. I then proceeded to add 2 data that are unrelated with each other. Then I fetch those 2 data back from weaviate alongside their vectors and then perform cosine similarity. However, I’m getting a higher score in the similarity even though those are unrelated with each other. Below is my setup:

import weaviate
import weaviate.classes.config as wc

headers = {"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")}

client = weaviate.connect_to_wcs(
        cluster_url=os.getenv("WEAVIATE_URL"),
        auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WEAVIATE_API_KEY")),
        headers=headers,
    )

 #creating the competency collection
client.collections.create(
        name="Competency",
        properties=[
            wc.Property(
                name="keyword", data_type=wc.DataType.TEXT
            ),
            wc.Property(
                name="tags", data_type=wc.DataType.TEXT
            ),
        ],
        # Define the vectorizer module
     vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
        # Define the generative module
        generative_config=wc.Configure.Generative.openai(),
)

# adding 2 unrelated data to the collection
competency = client.collections.get("Competency") 
competency.data.insert(
        properties={"keyword": "sleeping", "tags": "skill"},
    )
competency.data.insert(
        properties={"keyword": "python programming", "tags": "skill"},
    )

# getting the objects added
vectors = []
for item in competency.iterator(include_vector=True ):
    print(item.vector)
    vectors.append(item.vector)

# performing similarity between the 2 vectors
from numpy import dot
from numpy.linalg import norm


def cosine_similarity(v1, v2):
    return dot(v1, v2) / (norm(v1) * norm(v2))


similarity = cosine_similarity(vectors[0]['default'], vectors[1]['default'])
print(similarity) #outputs 0.9069379570795515

Server Setup Information

  • Weaviate Server Version: 1.25.2
  • Deployment Method: Cloud provided by weaviate

Any additional Information

Hi @abdimussa,

The reason, you get such a high similarity is because both the keyword and tags get vectorised together, and due to similarity of both "tags": "skill" you get similar vectors.

Skip Vectorization

You can configure your collection to vectorize specific properties, using skip_vectorization on your properties, like this:

from weaviate.classes.config import Configure, Property, DataType

client.collections.create(
    name="Competency",
    properties=[
        Property(
            name="keyword", data_type=DataType.TEXT
        ),
        Property(
            name="tags", data_type=DataType.TEXT, skip_vectorization=True # <== don't vectorize this property
        ),
    ],
    
    # Define the vectorizer module
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    
    # Define the generative module
    generative_config=Configure.Generative.openai(),
)

Named Vectors with Source Properties

Or you can use named vectors and source_properties (my preference), like this:

from weaviate.classes.config import Configure, Property, DataType

client.collections.create(
    name="Competency",
    properties=[
        Property(
            name="keyword", data_type=DataType.TEXT
        ),
        Property(
            name="tags", data_type=DataType.TEXT
        ),
    ],
    
    # Define the vectorizer module
    vectorizer_config=[
        Configure.NamedVectors.text2vec_openai(
            name="default",
            source_properties=["keyword"], # <== list the properties to be used for vectorization
        )
    ],
    
    # Define the generative module
    generative_config=Configure.Generative.openai(),
)

Hi @sebawita , I tried by even removing the tags property and create the collection only with the keyword property. I’m still getting a very high cosine similarity.

Another thing you can try. By default Weaviate uses the property name as part of the content to be vectorised. I am not sure how much this affects the generated vector, but you can change it with vectorize_property_name on the property like this:

client.collections.create(
    name="Competency",
    properties=[
        Property(
            name="keyword", data_type=DataType.TEXT, vectorize_property_name=False # <== don't vectorize the property name
        ),
        Property(
            name="tags", data_type=DataType.TEXT, skip_vectorization=True # <== don't vectorize this property
        ),
    ],
    
    # Define the vectorizer module
    vectorizer_config=Configure.Vectorizer.text2vec_openai(),
    
    # Define the generative module
    generative_config=Configure.Generative.openai(),
)

If your similarity is still too similar, this is mostly due to the model used, as Weaviate only asks (in this case) OpenAI to generate vectors. So, you would get the same results even if you call OpenAI APIs directly.

Also, in some cases models perform better when your input text is longer vs one-word content. But I am not sure how OpenAI Ada performs with short vs long content.

I removed the tags property entirely so that keywords will be the only ones available. I also did try embedding directly using OpenAi instead of through weaviate and the cosine similarity of the returned vectors does correctly translate to how the vectors are related. So I’m not sure how using weaviate is returning such a high cosine similarity for unrelated keywords.

hmmm… what model did you use when calling OpenAI directly?

I think Weaviate uses Ada-002 by default. You can change it to anything you need :wink:

Here you can see a list of available setting for the OpenAI vectorizer

Also, did you try to add vectorize_property_name on the vectorized property?

Property(
    name="keyword",
    data_type=DataType.TEXT,
    vectorize_property_name=False # <== don't vectorize the property name
),

The issue was with the ada vectorizer(the dafault). I changed it to text-embedding-3-small and it has now improved a lot. Thank you @sebawita !

1 Like