Duplicate UUIDs in GraphQL nearText with different vectors, one document in REST API

,

Description

We have an index where searching with nearText via the GraphQL endpoint returns results with duplicate UUIDs (which according to the docs is not possible) but the distance / vector results are different. In theory this means somehow a duplicate document was indexed with the same UUID but different vector embeddings but I can’t reproduce indexing a document with the same UUID locally (it fails as is documented).

For example, using the following GraphQL query directly (and client.query.get as well):

{
	Get {
		Prod_Slack_Am_4(nearText: {concepts: ["our query here"]}) {
			_additional {
			  id
              vector
			  distance
			  lastUpdateTimeUnix
			  creationTimeUnix
			}
		}
	}
}

We get these results back:

{
  "data": {
    "Get": {
      "Prod_Slack_Am_4": [
        {
          "_additional": {
            "distance": 0.22910476,
            "id": "7aa1c132-fcee-42e4-9466-26151a845f4c",
            "vector": [1, 2, 3] -- fake,
            "creationTimeUnix": "1706179278587",
            "lastUpdateTimeUnix": "1709610511694"
          },
         ...data that doesn't matter
        },
        {
          "_additional": {
            "distance": 0.23009855,
            "id": "7aa1c132-fcee-42e4-9466-26151a845f4c",
            "vector": [4, 5, 6] -- fake,
            "creationTimeUnix": "1706179278587",
           "lastUpdateTimeUnix": "1710128214984"
          },
         ...data that doesn't matter
        }      
      ]
    }
  }
}

However, when using the REST API directly to fetch the document by UUID, only one result is returned (with the 2nd document data).

/v1/objects/Prod_Slack_Am_4/7aa1c132-fcee-42e4-9466-26151a845f4c?consistency_level=ALL&include=vector or client.data_object.get_by_id(…) (both methods tested)

{
    "class": "Prod_Slack_Am_4",
    "creationTimeUnix": 1706179278587,
    "id": "7aa1c132-fcee-42e4-9466-26151a845f4c",
    "lastUpdateTimeUnix": 1710128214984,
    "vector": [4, 5, 6], -- fake
    "properties": { ... }
}

The timestamp from the REST API matches the 2nd document in the GraphQL endpoint but the created at timestamp is the exact same for both documents.

When fetching all objects using:

response = client.data_object.get(class_name="Prod_Slack_Am_4", limit=10_000)
objects = response["objects"]

And counting the number of duplicate UUIDs, we get 0 out of the 1,287 items that are returned.

Any thoughts here? I thought this was impossible. The only thing we can think of is that are cluster was recently upgraded around March 6/7 which is the day after most of the split of the lastUpdateTimeUnix on the duplicate documents and that maybe something strange happened during the upgrade, perhaps something to do with sharding (not sure).

Server Setup Information

  • Weaviate Server Version: 1.22.5
  • Deployment Method: WCS
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: 3.x, Python

Any additional Information

We created these documents using the Python client and batch and then used the data_object.update method to make updates to certain fields using the UUID. Not sure if that helps debug this more.

At first I also thought maybe this was due to cluster/nodes and data consistency but we only have one node so there isn’t really a way for this to be an issue.


UPDATE:
What is even more strange is I just updated the document with the UUID 7aa1c132-fcee-42e4-9466-26151a845f4c using the PATCH /objects/... method on an unrelated field (myField) for example, changing the value from “test2” back to the original value to “force” an internal re-index and now the duplicate is gone from the GraphQL API. Sometimes the duplicates go away but other times this just updates the 2nd (newer document).

This makes me thing the cluster upgrade did something strange.

UPDATE 2:
Filtering by UUID in the GraphQL endpoint also yields one result so only when nearText is used do we get multiple results for the same UUID.

{
	Get {
		Prod_Slack_Am_4(where: {path: ["id"], operator: Equal, valueString: "7aa1c132-fcee-42e4-9466-26151a845f4c"}) {
			...other fields
			_additional {
			  id
			  distance
			  lastUpdateTimeUnix
			  creationTimeUnix				
			}
		}
	}
}

FINAL UPDATE:
I wrote a script to copy the schema exactly with a different name and copy each document over one-by-one as a test.

I tested the case where we copy the existing vectors over and a case where we don’t pass the vectors down so that Weaviate creates new ones (we use text2vec-openai) but all other properties are identical.

The new schema in both cases has no duplicates when searching via GraphQL for the exact same query as above. I’m kind of at a loss here about how this happened and what to do moving forward.

Thanks!

Hi @emhagman !

Can you check if this is reproducible in latest Weaviate version and using the latest python code?

Also, can you consistently reproduce this?

Thanks!

Thanks for the response! I am working on reproducing it, it seems to potentially be related to auto schema and adding new vectorized fields to the schema but it’s hard to pin down.

We will be upgrading to 1.24.x very soon however this has happened on multiple prior versions of Weaviate for us, so in that sense it can be “consistently reproduced” but I don’t have a POC yet. I’ll keep you posted on the update separately.

Oh, thank you very much, @emhagman !

Those issues hard to reproduce are really valuable to us. Usually they online surface on very specific scenarios, that may involve OOMs, restarts, etc.

Usually a first start is trying to reindex. Maybe something went wrong on that first ingestion, so copying your data to a new cluster (or even a new collection on the same cluster) can usually eliminate this kind of issue when you get the expected outcome.

Thanks!

Yeah, we did just that (new collection, same cluster) and the new index works just fine. I believe we’ve had to do this a few times each time the index is “corrupted”. We’re on 1.24.x now and will keep an eye on things and see if it still occurs.

Hi @emhagman - is this on a multi-node cluster?

As far as I understand, a search in a multi-node, sharded cluster will search each shard & combine results. I wonder if there might be multiple copies of this object in different nodes somehow.

Wdyt @DudaNogueira ? Is that possible?

Thanks for the response @jphwang . That definitely makes sense in general but this issue was happening on a single node “cluster”. We were originally only on one node.

Just recently we upgraded to 3 but that was after we’ve already been dealing with this issue for a bit. I checked to see if somehow it had multiple shards for that one index and it did not (besides the default 128 virtual ones).

I haven’t taken down the single node setup yet so I can still debug things to try and reproduce it. If there are any calls I can make to the schema or debug info I can try to get to help out, let me know.