So I have a Competency collection created that contains keyword and skills as a property. I then proceeded to add 2 data that are unrelated with each other. Then I fetch those 2 data back from weaviate alongside their vectors and then perform cosine similarity. However, I’m getting a higher score in the similarity even though those are unrelated with each other. Below is my setup:
import weaviate
import weaviate.classes.config as wc
headers = {"X-OpenAI-Api-Key": os.getenv("OPENAI_API_KEY")}
client = weaviate.connect_to_wcs(
cluster_url=os.getenv("WEAVIATE_URL"),
auth_credentials=weaviate.auth.AuthApiKey(os.getenv("WEAVIATE_API_KEY")),
headers=headers,
)
#creating the competency collection
client.collections.create(
name="Competency",
properties=[
wc.Property(
name="keyword", data_type=wc.DataType.TEXT
),
wc.Property(
name="tags", data_type=wc.DataType.TEXT
),
],
# Define the vectorizer module
vectorizer_config=wc.Configure.Vectorizer.text2vec_openai(),
# Define the generative module
generative_config=wc.Configure.Generative.openai(),
)
# adding 2 unrelated data to the collection
competency = client.collections.get("Competency")
competency.data.insert(
properties={"keyword": "sleeping", "tags": "skill"},
)
competency.data.insert(
properties={"keyword": "python programming", "tags": "skill"},
)
# getting the objects added
vectors = []
for item in competency.iterator(include_vector=True ):
print(item.vector)
vectors.append(item.vector)
# performing similarity between the 2 vectors
from numpy import dot
from numpy.linalg import norm
def cosine_similarity(v1, v2):
return dot(v1, v2) / (norm(v1) * norm(v2))
similarity = cosine_similarity(vectors[0]['default'], vectors[1]['default'])
print(similarity) #outputs 0.9069379570795515
The reason, you get such a high similarity is because both the keyword and tags get vectorised together, and due to similarity of both "tags": "skill" you get similar vectors.
Skip Vectorization
You can configure your collection to vectorize specific properties, using skip_vectorization on your properties, like this:
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
name="Competency",
properties=[
Property(
name="keyword", data_type=DataType.TEXT
),
Property(
name="tags", data_type=DataType.TEXT, skip_vectorization=True # <== don't vectorize this property
),
],
# Define the vectorizer module
vectorizer_config=Configure.Vectorizer.text2vec_openai(),
# Define the generative module
generative_config=Configure.Generative.openai(),
)
Named Vectors with Source Properties
Or you can use named vectors and source_properties (my preference), like this:
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
name="Competency",
properties=[
Property(
name="keyword", data_type=DataType.TEXT
),
Property(
name="tags", data_type=DataType.TEXT
),
],
# Define the vectorizer module
vectorizer_config=[
Configure.NamedVectors.text2vec_openai(
name="default",
source_properties=["keyword"], # <== list the properties to be used for vectorization
)
],
# Define the generative module
generative_config=Configure.Generative.openai(),
)
Hi @sebawita , I tried by even removing the tags property and create the collection only with the keyword property. I’m still getting a very high cosine similarity.
Another thing you can try. By default Weaviate uses the property name as part of the content to be vectorised. I am not sure how much this affects the generated vector, but you can change it with vectorize_property_name on the property like this:
client.collections.create(
name="Competency",
properties=[
Property(
name="keyword", data_type=DataType.TEXT, vectorize_property_name=False # <== don't vectorize the property name
),
Property(
name="tags", data_type=DataType.TEXT, skip_vectorization=True # <== don't vectorize this property
),
],
# Define the vectorizer module
vectorizer_config=Configure.Vectorizer.text2vec_openai(),
# Define the generative module
generative_config=Configure.Generative.openai(),
)
If your similarity is still too similar, this is mostly due to the model used, as Weaviate only asks (in this case) OpenAI to generate vectors. So, you would get the same results even if you call OpenAI APIs directly.
Also, in some cases models perform better when your input text is longer vs one-word content. But I am not sure how OpenAI Ada performs with short vs long content.
I removed the tags property entirely so that keywords will be the only ones available. I also did try embedding directly using OpenAi instead of through weaviate and the cosine similarity of the returned vectors does correctly translate to how the vectors are related. So I’m not sure how using weaviate is returning such a high cosine similarity for unrelated keywords.