Not Equal Filter with Word Tokenization with non-alphanumeric characters

Description

Is it possible to use the not equal filter on properties with word tokenization even if they don’t have alphanumeric characters? Unfortunately I have data like test.com/2 with word tokenization when it wasn’t intended.

For example, I’d want to query test.com/2 using a not equal filter like below:

Filter.by_property("my_property").not_equal("test.com/2")

Does any workaround exist like query against “test com 2” or update tokenization or search with a different tokenization? My main issue is adding a new field or replacing with the proper tokenization is a lengthy process in a production system.

Server Setup Information

  • Weaviate Server Version: 1.24.0
  • Deployment Method: Docker
  • Multi Node? Number of Running Nodes: No.
  • Client Language and Version: 4.6.5
  • Multitenancy?: Yes

hi @dhanshew72 !

because you had tokenization set to word, the property value test.com/2 will be tokenized as test com 2

This will proves our point:

client.collections.delete("Test")
collection = client.collections.create(
    name="Test",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
)

collection.data.insert_many([
    {"text": "test.com/2"},
    {"text": "test.com/3"},
    {"text": "test.com/4"},

])

now we query:

results = collection.query.fetch_objects(
    filters=(
        wvc.query.Filter.by_property("text").equal("test") & 
        wvc.query.Filter.by_property("text").equal("com") & 
        wvc.query.Filter.by_property("text").equal("2")
    )
)
for i in results.objects:
    print("###")
    print(i.properties)

results:

{‘text’: ‘test.com/2’}

As you want to exclude that filtered objects, not equal on a word tokenization will not help you.

So you can try adding a new property, with the field tokenization, and then filling in the content of that property so you can filter it out.

Let me know if this helps :slight_smile:

1 Like

Interesting, I’ll make note of that. Thank you.

1 Like