BM25 need to reindex the whole corpus?

alexisperrier · March 25, 2024, 1:16pm

Since bm25 is based on tf-idf, it relies on the relative frequency of terms with respect to their frequency in the whole corpus
So each time I add a new document to the corpus, the relative weights of the terms should be re computed
But it does not seem to be the case when using weaviate and I doubt all the documents / records are parsed each time I add a new record.
What am I missing ?
Is the bm25 only taking into account the term frequency within the document? and not the whole corpus

DudaNogueira · March 25, 2024, 7:35pm

Hi!

Each time it time you add a new object, it will index the keywords of that object only, taking into account the tokenization of that property.

So no need to reindex all the other objects.

We have a blog post here:

That goes into more details on that.

Let me know if that helps!

Thanks!

alexisperrier · March 26, 2024, 3:57pm

Thanks
That’s what I was assuming it does but wanted to be sure

Topic		Replies	Views
After importing the document into Weaviate for a period of time, it cannot be searched using BM25, Support bug	4	555	October 4, 2023
Unable to get expected results using BM25 or any search functions Support	8	115	July 3, 2024
Storing multiple vectors per doc for hybrid search Support	2	541	September 15, 2023
Keyword Weighting Explanation Support	1	83	May 31, 2024
Query instruction for instruct embedding model General	1	52	June 26, 2024

BM25 need to reindex the whole corpus?

Related Topics