Keyword, vector and hybrid searching cause less rows to be retrieved

moaazzaki · February 7, 2024, 12:05pm

Description

I have a coolection with multi-tenancy enabled, for a given tenant, if I use the following graphql query:

{
    Get {
        SomeCollection(
            tenant: "some-id"
            limit: 2000
        ){
            someField
        }
    }
}

I get 2000 rows returned (I know that this tenant has 10k+ rows). Now when I try this query with bm25 for example:

{
    Get {
        SomeCollection(
            tenant: "some-id"
            limit: 2000
            bm25:{
                 query: "Some query"
                 properties: ["fieldWithText"]
            }
        ){
            someField
        }
    }
}

I get ~1.3k instead, the same applies no matter what limit I choose above 1.3k. Any idea what could the issue be here ?

Server Setup Information

I tried the above on two different setups:

Setup 1

Weaviate Version: 1.23.2
Deployment Method: docker
Multi Node? Number of Running Nodes: 1
Used Client Language and Version: REST API

Setup 2

Weaviate Version: 1.23.7
Deployment Method: k8s
Multi Node? Number of Running Nodes: 3
Used Client Language and Version: REST API

DudaNogueira · February 9, 2024, 11:48pm

hi @moaazzaki !!

the BM25 search will get you a ranked list of objects that has those query.

Consider the code below:

import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()

collection = client.collections.create(
    name="MyCollection",
    vectorizer_config=None
)

collection.data.insert({"text": "SOme text here"})
collection.data.insert({"text": "SOme other text here"})
collection.data.insert({"text": "Other stuff"})

collection.query.bm25(
      query="stuff",
      query_properties=["text"],
      return_metadata=wvc.query.MetadataQuery(score=True),
      limit=5,
  )

query = collection.query.bm25(
      query="other stuff",
      query_properties=["text"],
      return_metadata=wvc.query.MetadataQuery(score=True),
      limit=5,
  )
print(len(query.objects))

it will yeld only 2 objects.

Let me know if that helps

moaazzaki · February 12, 2024, 8:13am

Hi @DudaNogueira, thanks for your response!

The problem is that this happens as well for vector & hybrid search, and same data can sometimes work and sometimes not on the same query, so for a tenant with 10k data, I get 6k maximum with hybrid search, and if I removed the collection and insert it again, it can be a different number than 6k (e.g. it was 1.3k before), so I feel like there is a bit of inconsistency here.

DudaNogueira · February 12, 2024, 10:29pm

Hi!

I believe this is the case of a higher efConstruction. From here:

HNSW is an approximate nearest neighbour algorithm. This means there is always a chance a vector will not be found in a search even though it would be found via brute force (using where filter).

The solution to this problem is to re-index your data with higher efConstruction parameter (for example efConstruction=512) that way your newly built graph will have higher number of node candidates that will be considered during vector search.

moaazzaki · February 13, 2024, 9:42am

Hi!

Does this also apply on flat index setup ?

Topic		Replies	Views
Cannot get total size of result set for pagination for hybrid, vector or bm25 search Support	2	191	April 15, 2025
NEW blog post: Multi-Tenancy Vector Search with millions of tenants Resources blog	0	564	June 15, 2023
Multi-collections or one collection and filter? Support	6	735	January 18, 2024
Bm25_search - DoBlockMaxWand: search timed out, returning partial results Support	4	14	August 1, 2025
Wrong retrieval results with near_vector and hybrid search Support	1	146	June 27, 2024

Keyword, vector and hybrid searching cause less rows to be retrieved

Description

Server Setup Information

Setup 1

Setup 2

Related topics