Description
We have an index where searching with nearText
via the GraphQL endpoint returns results with duplicate UUIDs (which according to the docs is not possible) but the distance
/ vector
results are different. In theory this means somehow a duplicate document was indexed with the same UUID but different vector
embeddings but I can’t reproduce indexing a document with the same UUID locally (it fails as is documented).
For example, using the following GraphQL query directly (and client.query.get as well):
{
Get {
Prod_Slack_Am_4(nearText: {concepts: ["our query here"]}) {
_additional {
id
vector
distance
lastUpdateTimeUnix
creationTimeUnix
}
}
}
}
We get these results back:
{
"data": {
"Get": {
"Prod_Slack_Am_4": [
{
"_additional": {
"distance": 0.22910476,
"id": "7aa1c132-fcee-42e4-9466-26151a845f4c",
"vector": [1, 2, 3] -- fake,
"creationTimeUnix": "1706179278587",
"lastUpdateTimeUnix": "1709610511694"
},
...data that doesn't matter
},
{
"_additional": {
"distance": 0.23009855,
"id": "7aa1c132-fcee-42e4-9466-26151a845f4c",
"vector": [4, 5, 6] -- fake,
"creationTimeUnix": "1706179278587",
"lastUpdateTimeUnix": "1710128214984"
},
...data that doesn't matter
}
]
}
}
}
However, when using the REST API directly to fetch the document by UUID, only one result is returned (with the 2nd document data).
/v1/objects/Prod_Slack_Am_4/7aa1c132-fcee-42e4-9466-26151a845f4c?consistency_level=ALL&include=vector or client.data_object.get_by_id(…) (both methods tested)
{
"class": "Prod_Slack_Am_4",
"creationTimeUnix": 1706179278587,
"id": "7aa1c132-fcee-42e4-9466-26151a845f4c",
"lastUpdateTimeUnix": 1710128214984,
"vector": [4, 5, 6], -- fake
"properties": { ... }
}
The timestamp from the REST API matches the 2nd document in the GraphQL endpoint but the created at timestamp is the exact same for both documents.
When fetching all objects using:
response = client.data_object.get(class_name="Prod_Slack_Am_4", limit=10_000)
objects = response["objects"]
And counting the number of duplicate UUIDs, we get 0 out of the 1,287 items that are returned.
Any thoughts here? I thought this was impossible. The only thing we can think of is that are cluster was recently upgraded around March 6/7 which is the day after most of the split of the lastUpdateTimeUnix on the duplicate documents and that maybe something strange happened during the upgrade, perhaps something to do with sharding (not sure).
Server Setup Information
- Weaviate Server Version: 1.22.5
- Deployment Method: WCS
- Multi Node? Number of Running Nodes: 1
- Client Language and Version: 3.x, Python
Any additional Information
We created these documents using the Python client and batch
and then used the data_object.update
method to make updates to certain fields using the UUID. Not sure if that helps debug this more.
At first I also thought maybe this was due to cluster/nodes and data consistency but we only have one node so there isn’t really a way for this to be an issue.
UPDATE:
What is even more strange is I just updated the document with the UUID 7aa1c132-fcee-42e4-9466-26151a845f4c
using the PATCH /objects/...
method on an unrelated field (myField
) for example, changing the value from “test2” back to the original value to “force” an internal re-index and now the duplicate is gone from the GraphQL API. Sometimes the duplicates go away but other times this just updates the 2nd (newer document).
This makes me thing the cluster upgrade did something strange.
UPDATE 2:
Filtering by UUID in the GraphQL endpoint also yields one result so only when nearText
is used do we get multiple results for the same UUID.
{
Get {
Prod_Slack_Am_4(where: {path: ["id"], operator: Equal, valueString: "7aa1c132-fcee-42e4-9466-26151a845f4c"}) {
...other fields
_additional {
id
distance
lastUpdateTimeUnix
creationTimeUnix
}
}
}
}
FINAL UPDATE:
I wrote a script to copy the schema exactly with a different name and copy each document over one-by-one as a test.
I tested the case where we copy the existing vectors over and a case where we don’t pass the vectors down so that Weaviate creates new ones (we use text2vec-openai
) but all other properties are identical.
The new schema in both cases has no duplicates when searching via GraphQL for the exact same query as above. I’m kind of at a loss here about how this happened and what to do moving forward.
Thanks!