[Non deterministic vector search return]

Hi Team, we observed the vector similarity search return results varies for the same query on different attempts, want to get some clarity on whether it is an expected behavior of Weaviate?

Below is a sample query we use, the embedding_id should be unique as it was used to create the uuid when importing the data (generate_uuid5(embedding_id)).
The collection has ~17M embeddings, out of 1000 results returned in two attemps, ~10 -15 embedding_ids are different.

query = f"“”
{{
Get {{
Data_discovery_test (
limit: 1000
nearVector: {{
vector: [{query_vector}]
}}

            ) {{
                embedding_id
                _additional {{
                  distance
                }}
            }}
        }}
    }}
    """

response = weaviate_client.query.raw(query)

*note the collection has replica factor : 2
Here is the corresponding replicationConfig/shardingConfig:

   'replicationConfig': {'factor': 2},
   'shardingConfig': {'virtualPerPhysical': 128,
    'desiredCount': 2,
    'actualCount': 2,
    'desiredVirtualCount': 256,
    'actualVirtualCount': 256,
    'key': '_id',
    'strategy': 'hash',
    'function': 'murmur3'},

An other potential query issue I found (maybe have some correlation with above issue) is conditional filter on some unique Id returns multiple results, e,g,

INPUT_EMBEDDING_ID = "***"
query = '{Get {
          Data_discovery(
            where: {
              path: ["embedding_id"] 
              operator: Equal 
              valueText: {INPUT_EMBEDDING_ID}
              } 
            )
            {
              embedding_id 
              _additional {vector}
            }
          }
        }' 

response = weaviate_client.query.raw(query)

Response returns 5 different embeddings / embedding IDs for a unique input embedding_id.

But if filter on the id of the collection, (the id is generated with generate_uuid5(embedding_id) at data import phase) the return result is unique
from weaviate.util import generate_uuid5

INPUT_EMBEDDING_ID = "***"
UUID = generate_uuid5(INPUT_EMBEDDING_ID)
query = '{Get {
          Data_discovery(
            where: {
              path: ["_id"] 
              operator: Equal 
              valueText: {UUID}
              } 
            )
            {
              embedding_id 
              _additional {vector}
            }
          }
        }' 
response = weaviate_client.query.raw(query)

The data collection is the same as the above query, which contains ~17M embedding

Server Setup Information

  • Weaviate Server Version: 1.24.3/helm chart 16.8.7
  • Deployment Method: helm chart / K8s
  • Multi Node? Number of Running Nodes: 2
  • Client Language and Version: Python 3.10 / weaviate-client 4.5.1

Hello @DudaNogueira , hope this can get some attention from you.

Hi @fairymane ! Sorry for the delay gere :frowning:

I was out those days.

This seems a case where there are multiple objects with the same embedding_id property with value of INPUT_EMBEDDING_ID

You can check this by querying something like:

{Get {
          Data_discovery(
            where: {
              path: ["embedding_id"] 
              operator: Equal 
              valueText: {INPUT_EMBEDDING_ID}
              } 
            )
            {
              embedding_id 
              _additional {vector id}
            }
          }
        }

note that I added the id to be returned on that query.

You should get the N amount of object, with different ids.

Let me know if this helps :slight_smile:

Hi @DudaNogueira , no problem, hope you have some good time-off!

Here I asked 2 questions, the first one is about the non- deterministic return results on the same query, do you have any idea on why it happened? Is the non deterministic return results expected or unexpected behavior ? (The weaviate cluster has 3 nodes, and the collection replica is 2).

For the second question, with you suggestion, here is the data returned:

INPUT_EMBEDDING_ID = "2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1_top_1_width_938_height_618"

    query = f"""
        {{
            Get {{
                Data_discovery (
                     where: {{
                        path: ["embedding_id"],
                        operator: Equal,
                        valueText: "{INPUT_EMBEDDING_ID}"
                      }}
      
                ) {{
                    embedding_id
                    _additional 
                    {{
                      id
                    }}
                }}
            }}
        }}
        """
response = weaviate_client.query.raw(query)

Here is the returns:

{‘_additional’: {‘id’: ‘26624310-b690-57fc-a8f3-94d791333105’}, ‘embedding_id’: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1877_top_1_width_938_height_618’},
{‘_additional’: {‘id’: ‘3d27d492-17fe-555e-b32d-a27c663b8f21’}, ‘embedding_id’: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1_top_1_width_938_height_618’},
{‘_additional’: {‘id’: ‘4cef39ee-ca26-5764-abcb-e9ae6f193b0e’}, ‘embedding_id’: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1_top_619_width_938_height_618’},
{‘_additional’: {‘id’: ‘58c2b127-d6a2-5369-a575-e7fc2b7194ed’}, ‘embedding_id’: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1_top_1237_width_938_height_618’},
{‘_additional’: {‘id’: ‘a9cd538c-922f-5529-9cfb-99ecdda84090’}, ‘embedding_id’: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_939_top_1_width_938_height_618’}]

You can see 5 different embedding_ids above, and only the second item in the returned results above actually matches the INPUT_EMBEDDING_ID, and the rest are different embedding_ids (they share some prefix though: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__*’).

The collection was imported with explicitly generated uuid:

...
            batch.add_data_object(
                data_object=pilotImage.get_properties(),
                class_name=PilotImage.name(collection_name),
                vector=pilotImage.embeddings,
                **uuid = generate_uuid5(pilotImage.embedding_id)**
            )
...

So in my case if I filter on the id of the collection, (the id is generated with generate_uuid5(embedding_id) at data import phase) the return result is unique:

GEN_UUID = generate_uuid5(INPUT_EMBEDDING_ID)
print(GEN_UUID)

3d27d492-17fe-555e-b32d-a27c663b8f21

query_2 = f"""
        {{
            Get {{
                Data_discovery (
                     where: {{
                        path: ["_id"],
                        operator: Equal,
                        valueText: "{GEN_UUID}"
                      }}
      
                ) {{
                    embedding_id
                    _additional 
                    {{
                      id
                    }}
                }}
            }}
        }}
        """
response_2 = weaviate_client.query.raw(query_2)
print(response_2)

[{‘_additional’: {‘id’: ‘3d27d492-17fe-555e-b32d-a27c663b8f21’}, ‘embedding_id’: ‘2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1_top_1_width_938_height_618’}]

So it seems the inverted index on the embedding_id doesn’t perform the filtering properly.

When doing a filter, you need to be aware of the tokenization:

so 2024.02.09.00.38.40_KMHKC81EFR1000069__CAM_M_L1__1707443135521304__left_1_top_1_width_938_height_618 will be tokenized into multiple small parts.

If you don’t want that, you need to set the tokenization for that property to field

For the second question, when you search

where: {{
                        path: ["_id"],
                        operator: Equal,
                        valueText: "3d27d492-17fe-555e-b32d-a27c663b8f21"
                      }}

you should get only one result, with the id 3d27d492-17fe-555e-b32d-a27c663b8f21.

Are you getting more results? Sorry, I am a bit confused here :slight_smile:

It’s always interesting to post full working example, preferably a python notebook, so I can easily follow along.

Thanks!