Hi Team, we observed the vector similarity search return results varies for the same query on different attempts, want to get some clarity on whether it is an expected behavior of Weaviate?
Below is a sample query we use, the embedding_id should be unique as it was used to create the uuid when importing the data (generate_uuid5(embedding_id)).
The collection has ~17M embeddings, out of 1000 results returned in two attemps, ~10 -15 embedding_ids are different.
query = f"“”
{{
Get {{
Data_discovery_test (
limit: 1000
nearVector: {{
vector: [{query_vector}]
}}) {{ embedding_id _additional {{ distance }} }} }} }} """
response = weaviate_client.query.raw(query)
*note the collection has replica factor : 2
Here is the corresponding replicationConfig/shardingConfig:
'replicationConfig': {'factor': 2},
'shardingConfig': {'virtualPerPhysical': 128,
'desiredCount': 2,
'actualCount': 2,
'desiredVirtualCount': 256,
'actualVirtualCount': 256,
'key': '_id',
'strategy': 'hash',
'function': 'murmur3'},
An other potential query issue I found (maybe have some correlation with above issue) is conditional filter on some unique Id returns multiple results, e,g,
INPUT_EMBEDDING_ID = "***"
query = '{Get {
Data_discovery(
where: {
path: ["embedding_id"]
operator: Equal
valueText: {INPUT_EMBEDDING_ID}
}
)
{
embedding_id
_additional {vector}
}
}
}'
response = weaviate_client.query.raw(query)
Response returns 5 different embeddings / embedding IDs for a unique input embedding_id.
But if filter on the id of the collection, (the id is generated with generate_uuid5(embedding_id) at data import phase) the return result is unique
from weaviate.util import generate_uuid5
INPUT_EMBEDDING_ID = "***"
UUID = generate_uuid5(INPUT_EMBEDDING_ID)
query = '{Get {
Data_discovery(
where: {
path: ["_id"]
operator: Equal
valueText: {UUID}
}
)
{
embedding_id
_additional {vector}
}
}
}'
response = weaviate_client.query.raw(query)
The data collection is the same as the above query, which contains ~17M embedding
Server Setup Information
- Weaviate Server Version: 1.24.3/helm chart 16.8.7
- Deployment Method: helm chart / K8s
- Multi Node? Number of Running Nodes: 2
- Client Language and Version: Python 3.10 / weaviate-client 4.5.1