I’ve been using Weaviate for almost a year now, I’m using the Hybrid Search, BM25, with a voyage embedding model.
I was recently looking at SPLADE, and was wondering if there are any plans to integrate this into the Keyword search, as a replacement, or alongside BM25.
I’d like to be able to do a Hybrid search with the Dense vectors + SPLADE + BM25, this is my dream scenario. But just Dense vectors + SPLADE would also make me happy.
We haven’t prioritized SPLADE or more generally sparse models for the following reasons:
There is still are not a huge variety of sparse models available, none of the major embedding providers have sparse models, additionally sparse models have much weaker scores for retrieval benchmarks. For example compare the BEIR scores in the SPLADEv3 model vs the current leaderboard MTEB Leaderboard - a Hugging Face Space by mteb .
A sparse model like SPLADEv3 can beat BM25 on its own but there is a lack of research around what happens when combining with good dense models in a hybrid setting.
Practically we find BM25 pairs very well with dense vector search in that it handles out-of-distribution tokens / keywords / identifiers well while normal dense vector search handles semantic queries. A part of this is how BM25 adapts to collections having different document frequencies. Conversely alternative solutions using sparse indexes for BM25 have had problems in how they can use non-static document frequencies.
Adding a sparse model to an existing dense + bm25 index will necessarily add latency and complexity.