Bm25 Search on Non-English Languages

jpiabrantes · December 11, 2025, 10:15am

Let’s say the language is Portuguese or French. I want to get rid of accents and special characters. The usual approach is to normalise text before indexing/querying
ã → a
ê → e
ç → c
etc.

The easiest way, would be for each entry in the DB to have a normalized version and the original version, then I query against the normalized and retrieve the original. But this is not efficient in memory (I basically duplicated all my fields).

I see that you have tokenization. Would it make sense that normalization is part of tokenization. If so should I develop a tokenizer for Portuguese/French, can you point me to the code of the current tokenizers so that I try to adapt them?

DudaNogueira · December 11, 2025, 9:03pm

hi @jpiabrantes !!

This is a really interesting question and a potential feature request.

We can probably set something while tokenizing words to index both tokens: with and without accent.

I will bring this to our team and will get back here.

jpiabrantes · December 12, 2025, 12:02pm

Thank you!

Another idea for custom tokenizers is just to allow users to give a map with synonyms.

For example, in the Law domain we want bunch of abbreviations to match to the same token:

cp → Código Penal
cpp → Código de Processo Penal
cpc → Código de Processo Civil
cc → Código Civil

etc etc

jpiabrantes · January 4, 2026, 3:29pm

Any news on this? Did you end up creating a GitHub issue?

DudaNogueira · January 4, 2026, 3:33pm

Hi João! Not yet, but this is in our internal backlog already.

This will be part of some other planned changes on that code base, so on top of of Accent normalization it supports mapping.

And for now, we do not have an ETA

Topic		Replies	Views
Lowercase tokenization doesn't seem to be working Support	2	424	February 20, 2024
Select tokenizer for search/filter Support	2	672	November 16, 2023
BM25 CJK (Chinese, Japanese, Korean) Support Support	1	535	October 18, 2024
Not Equal Filter with Word Tokenization with non-alphanumeric characters Support	2	509	October 16, 2024
BM25 returns high scores for English-heavy documents when query contains English tokens in Chinese-English mxied chunks (enable GSE) Support python	3	26	March 4, 2026

Bm25 Search on Non-English Languages

Related topics