Text Array Filtering (Array intersection) - Exact element match issue

I am experiencing issues with array (think permissions-style) filtering in my hybrid search. I suspect the problem lies in my data type definition, tokenization settings, or the filter query itself. Here’s my scenario:

In my collection, I have a property that lists the roles with access to each data row.

Property: allowrole = [“Admin”, “User Admin”, “Reader”, “Editor”]

Property(name=“allowrole”, data_type=DataType.TEXT_ARRAY, skip_vectorization=True, vectorize_property_name=False, tokenization=Tokenization.FIELD),

I have defined this property as DataType.TEXT_ARRAY and tried different tokenization settings (FIELD, LOWERCASE, none, etc).

Goal: Filter results where at least one role in the query array exactly matches one of the roles in allowrole.

Example Data:

  1. {name: "TestA", allowrole: ["Reader"]}
  2. {name: "TestB", allowrole: ["Admin"]}
  3. {name: "TestC", allowrole: ["User Admin"]}
  4. {name: "TestD", allowrole: ["Admin", "Editor"]}

Example Query Array for input
["Admin", "Reader"]

Issue: When using ContainsAny filtering, a query with ["Admin", "Reader"] returns all records, including partial matches like “Admin” in “User Admin”. The expected result should be Data 1, 2, and 4 only.

Using an OR filter on allowrole also results in partial matches, which is not desired.

How can I achieve exact matching for array filtering like this?

1 Like

Hi @jasonrwright !!

Welcome to our community :hugs:

I was not able to reproduce this.

Here my code:

client.collections.delete("Test")
collection = client.collections.create(
    "Article",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
    properties=[
        wvc.config.Property(name="name", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(
            name="allowrole",
            data_type=wvc.config.DataType.TEXT_ARRAY,
            skip_vectorization=True,
            vectorize_property_name=False,
            tokenization=wvc.config.Tokenization.FIELD,
        ),
    ],
)

collection.data.insert({"name": "TestA", "allowrole": ["Reader"]})
collection.data.insert({"name": "TestB", "allowrole": ["Admin"]})
collection.data.insert({"name": "TestC", "allowrole": ["User Admin"]})
collection.data.insert({"name": "TestD", "allowrole": ["Admin", "Editor"]})

Now I query as you described:

results = collection.query.fetch_objects(
    #filters=wvc.query.Filter.by_property("allowrole").contains_any(["Reader"])
    #filters=wvc.query.Filter.by_property("allowrole").contains_any(["Admin"])
    filters=wvc.query.Filter.by_property("allowrole").contains_any(["Admin", "Reader"])
    #filters=wvc.query.Filter.by_property("allowrole").contains_any(["User Admin", "Reader"])
)
for r in results.objects:
    print(r.properties)

this is the output:

{‘allowrole’: [‘Reader’]> , ‘name’: ‘TestA’}
{‘allowrole’: [‘Admin’], ‘name’: ‘TestB’}
{‘allowrole’: [‘Admin’, ‘Editor’], ‘name’: ‘TestD’}

Please, let me know if this helps.

Thanks!

The great news is it indeed does work as your example shows! The “bad” news is that my refactoring which was leading me to believe it did not was flawed as I was looking at a previous collection before I made the changes to the schema type and tokenization settings and re-imported.

Thanks for the response, it was a great help. For future searchers on this topic, indeed the property setup of having data_type=DataType.TEXT_ARRAY, tokenization=Tokenization.FIELD and using contains_any for sure does the trick and appears to be quite efficient.

1 Like

Ohhhh! This happened to me already. More than once :see_no_evil:

When doing some experiments, I always delete the collection first. I have reused a Test collection too many times and faced this exact issue!

Glad it all worked out!

Whenever needed, we are here to help!

Thanks!