Keyword, vector and hybrid searching cause less rows to be retrieved

Description

I have a coolection with multi-tenancy enabled, for a given tenant, if I use the following graphql query:

{
    Get {
        SomeCollection(
            tenant: "some-id"
            limit: 2000
        ){
            someField
        }
    }
}

I get 2000 rows returned (I know that this tenant has 10k+ rows). Now when I try this query with bm25 for example:

{
    Get {
        SomeCollection(
            tenant: "some-id"
            limit: 2000
            bm25:{
                 query: "Some query"
                 properties: ["fieldWithText"]
            }
        ){
            someField
        }
    }
}

I get ~1.3k instead, the same applies no matter what limit I choose above 1.3k. Any idea what could the issue be here ?

Server Setup Information

I tried the above on two different setups:

Setup 1

  • Weaviate Version: 1.23.2
  • Deployment Method: docker
  • Multi Node? Number of Running Nodes: 1
  • Used Client Language and Version: REST API

Setup 2

  • Weaviate Version: 1.23.7
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 3
  • Used Client Language and Version: REST API

hi @moaazzaki !!

the BM25 search will get you a ranked list of objects that has those query.

Consider the code below:

import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()

collection = client.collections.create(
    name="MyCollection",
    vectorizer_config=None
)

collection.data.insert({"text": "SOme text here"})
collection.data.insert({"text": "SOme other text here"})
collection.data.insert({"text": "Other stuff"})

collection.query.bm25(
      query="stuff",
      query_properties=["text"],
      return_metadata=wvc.query.MetadataQuery(score=True),
      limit=5,
  )

query = collection.query.bm25(
      query="other stuff",
      query_properties=["text"],
      return_metadata=wvc.query.MetadataQuery(score=True),
      limit=5,
  )
print(len(query.objects))

it will yeld only 2 objects.

Let me know if that helps :slight_smile:

Hi @DudaNogueira, thanks for your response!

The problem is that this happens as well for vector & hybrid search, and same data can sometimes work and sometimes not on the same query, so for a tenant with 10k data, I get 6k maximum with hybrid search, and if I removed the collection and insert it again, it can be a different number than 6k (e.g. it was 1.3k before), so I feel like there is a bit of inconsistency here.

Hi!

I believe this is the case of a higher efConstruction. From here:

HNSW is an approximate nearest neighbour algorithm. This means there is always a chance a vector will not be found in a search even though it would be found via brute force (using where filter).

The solution to this problem is to re-index your data with higher efConstruction parameter (for example efConstruction=512) that way your newly built graph will have higher number of node candidates that will be considered during vector search.

Hi!

Does this also apply on flat index setup ?