hi @torbenw !!
The pre-filtering will affect the overall performance.
Let’s say you have 20 million objects, or 10 difference company_id, each with 2kk objects.
if you are searching thru all 20 million objects, and just filtering by company_id=1
, it will unnecessarily need to remove the other 18 million objects from the allowed list, both while doing keyword and vector search and for both on hybrid.
On the vector indexing side, it means that it will use objects from different company_id to calculate the HNSW construction. Now, removing the objects from company_id=1
will mean that all objects that are company_id != 1
and connected to an object that is company_id=1
will need to be recalculated, requiring more CPU power.
When you have a separate collection/tenant, this is not necessary: you can just drop the entire collection/tenant.
Does that mean that when I have different knowledge for different use cases I should always have a single collection for that use case so the data of the other use cases does not negatively impact the search result quality?
This. a single collection or a collection with multi tenant.
Considering our company_id
example: If you need to perform a query on multiple companies, having all companies at the same collection makes sense.
Or, you can have multiple collections/tenants and perform multi queries and then merge the results for each query.
Now, if the use case will be restricted to each company_id
, meaning, you do not need cross company search, and if the properties of the collections are the same, multi tenancy is highly recommended, as you will have isolation for each company.
Let me know if this clarifies 