How does the filter ContainsAny and ContainsAll work?

If you have a list that contains sentences [“Dorthraki language”, “Ktulhu Monster lives in”] and you are searching for text chunk with the operator ContainsAny, is it supposed to match the text chunks only if the full phrase is in the text or also if the text contains a partial match for the string in the list? (like the text chunk doesn’t contain “Dorthraki language” but contains the word “Dorthraki”, is it supposed to match?)

Thanks!

hi @A_S !! Welcome back :slight_smile:

The behavior will depend on what is the tokenization you have for that specific property.

by default, the tokenization is word. This means that for the query you are running, it should match to all word tokens dorthraki language ktulhu``monster (in is a stop word)

With that said, consider this code:

client.collections.delete("Test")
collection = client.collections.create(
    "Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
    properties=[
        wvc.config.Property(
            name="text_word", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.WORD,
        ),
        wvc.config.Property(
            name="text_field", data_type=wvc.config.DataType.TEXT, tokenization=wvc.config.Tokenization.FIELD
        )
    ]
)
collection.data.insert({"text_word": "Dorthraki language here", "text_field": "Dorthraki language"})
collection.data.insert({"text_word": "Ktulhu language Dorthraki", "text_field": "Ktulhu language Dorthraki"})

now, when I do a contains any on the text_field that has the field tokenization property, I will find one result, like this:

collection.aggregate.over_all(
    filters=wvc.query.Filter.by_property("text_field").contains_any(["Dorthraki language"])
)

AggregateReturn(properties={}, total_count=1)

while if I do the same query, but on the text_word property, you will find both objects:

collection.aggregate.over_all(
    filters=wvc.query.Filter.by_property("text_word").contains_any(["Dorthraki language"])
)

AggregateReturn(properties={}, total_count=2)

We have some extensive material on tokenization. Check this out:

and here:

Let me know if this helps :slight_smile:

1 Like

Thank you! That explained a lot!

1 Like