Fuzzy matching algorithms

In my application I need to apply apply fuzzy search on a text field to find matches even if the string was mis-typed. Imagine an application that needs to filter objects on a “surname” field and especially with foreign names the exact match could not work. Cosine nearness on embeddings would not help either in this case.

In my application I use this python library RapidFuzz · PyPI which is very very good.

Would there be any mechanism to apply an external function such as this to a text field in a collection without having to applicatively iterate through all the collection?

Thanks

I don’t think we have anything directly like this but the following could work:

  • use named vectors
  • define a named vector with the vectorizer text2vec-bigram that takes your surname field as input. This is more of a test vectorizer and not documented but I think it could work for your usecase. You can check the code here: weaviate/modules/text2vec-bigram/bigram.go at main · weaviate/weaviate · GitHub
  • Those vectors should be very close when there is just a typo in the name and most of it is the same

Note that this module is probably not enabled in weaviate cloud

1 Like

Very interesting Dirk, thanks.

I will definitely experiment.

But actually my example was very simplistic. The true task is quite tough with searching through a long list of wordplays.

These wordplays are obtained though techniques such as splitting or fusing a legal word, making deliberate typos etc, therefore the various algorithms of the rapidfuzz library give me more latitude.

The true task is quite tough with searching through a long list of wordplays.

Sadly don’t have a better idea :frowning:
The mentioned vectorizer has a few options so maybe play around and see if any of them works. Please let me know if something comes out of this!

1 Like