Bm25 Search on Non-English Languages

Let’s say the language is Portuguese or French. I want to get rid of accents and special characters. The usual approach is to normalise text before indexing/querying
ã → a
ê → e
ç → c
etc.

The easiest way, would be for each entry in the DB to have a normalized version and the original version, then I query against the normalized and retrieve the original. But this is not efficient in memory (I basically duplicated all my fields).

I see that you have tokenization. Would it make sense that normalization is part of tokenization. If so should I develop a tokenizer for Portuguese/French, can you point me to the code of the current tokenizers so that I try to adapt them?

hi @jpiabrantes !!

This is a really interesting question and a potential feature request.

We can probably set something while tokenizing words to index both tokens: with and without accent.

I will bring this to our team and will get back here.

Thank you!

Another idea for custom tokenizers is just to allow users to give a map with synonyms.

For example, in the Law domain we want bunch of abbreviations to match to the same token:

cp → Código Penal
cpp → Código de Processo Penal
cpc → Código de Processo Civil
cc → Código Civil

etc etc

1 Like

Any news on this? Did you end up creating a GitHub issue?

Hi João! Not yet, but this is in our internal backlog already.

This will be part of some other planned changes on that code base, so on top of of Accent normalization it supports mapping.

And for now, we do not have an ETA :frowning: