Multi tenancy dosen't help in our scenario when the number of collection reach 1000

Description

We’re building a multi-tenancy system. Each tenant will manage their own collections with different properties.
As the number of the tenants growing, the total number of the collections will reach 1000 soon.
As I know, the multi-tenancy config of weaviate collection is based on the same collection shared by different tenants which is not the same with us.
How can I fix it? Increase the limit of the MAXIMUM_ALLOWED_COLLECTIONS_COUNT? but it’s not recommended.

Server Setup Information

  • Weaviate Server Version: 1.30
  • Deployment Method: Kubernets
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: python
  • Multitenancy?: False

Any additional Information

Good morning @Charlie_Chen and welcome to the community — it’s great to have you here! We’re excited to help however we can.

You can absolutely create as many tenants as you want. The only real limit is your resources associate with DB — Weaviate can handle millions in tenants.

Now about collections: it’s generally recommended not to go over 1,000 of them. But tenants are different — they live within a collection. So if you’re using the same schema across multiple users or use cases, it’s much better to create one global collection and use tenants inside it. Each tenant is isolated, own shard, and pretty much behaves like its own collection — but it’s faster, more efficient, and easier to manage.

For example, say you’re building a chatbot app. Each of your users gets their own chatbot. Instead of creating a separate collection for every single user (which could really hurt performance), you’d just create one chatbot collection and make each user a tenant. Since all chatbots likely use the same schema, this setup works perfectly — you can scale to millions of users, and each tenant stays separate.

The main takeaway: avoid creating too many collections; focus on using tenants inside a shared collection when the schema is the same.

If I understood you well, you have a lot of users and in each user “collection” would be tenants, if this is the case - I would look into the schema plan again from different perspective. I am not sure really what exactly your use case.

One last note — if you’re planning to run in production at some point or test on high scale, make sure your setup includes multiple nodes and doesn’t rely on just one. Also, Weaviate is now at version 1.32.1 — you can check out the latest release notes here:

Hope this clears things up!

Best,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)

Thanks for your response.

In my case — a low-code platform — each tenant can create their own collections, each with a different schema. That means tenants may define completely different sets of properties for their collections.

The only constraint I can enforce is the maximum number of collections each tenant can create — for example, 100 per tenant. But as you can imagine, with just 10 tenants, this could easily scale up to 1,000 collections, which is not sustainable.

I’ve considered two potential workarounds, but both have clear downsides:

  1. Single giant collection with a metadata text property:
    I could serialize tenant-defined properties into a JSON string and store them in a single metadata text field. But this approach severely limits filtering capabilities — I won’t be able to query by individual properties.

  2. Single giant collection with a metadata object property:
    I could store tenant-specific properties as nested fields inside a single metadata object. However, as far as I know, Weaviate currently doesn’t support filtering on nested object properties.

Do you have any recommendations or best practices for handling this kind of multi-tenant, dynamic-schema scenario?

Hey - the big problems with many collections is that the Graphql schema needs to be rebuild everytime you add/remove a collection or restart weaviate and the more collections you have the longer it will take.

I think the only way around this is to disable GraphQL using the `DISABLE_GRAPHQL` env var (this needs a restart). If you’re using our Python/TS clients everything will continue to work, but our old java/go clients are not supporting GRPC yet. There is already a beta for java with GRPC support out, but it does not support all features yet: Release 6.0.0-beta3 - Custom TrustStore, Fat JARs, Metadata Fields · weaviate/java-client · GitHub

Thanks, Is there any drawback?

hi @Charlie_Chen !!

DISABLE_GRAPHQL will disable all GRAPHQL endpoints while keeping REST and GRPC operations.

this means that if you are doing any graphql queries, it will fail. GRPC and REST calls will work as expected, so using any of our clients or connecting direct is still available.

Hey @Charlie_Chen

This might be one of those cases where enabling DISABLE_GRAPHQL could save you a lot of headaches, at least in the short term.

Quick fix to try

DISABLE_GRAPHQL=true

Possible benefits:

  • Skips GraphQL schema rebuilds → much faster startup

  • Handles hundreds or thousands of collections better

  • Works fine with Python/TS clients

  • (Loss) You lose GraphQL queries + admin tools

If you just need things running smoothly right now, this could be the fastest way forward.

Longer-term architecture suggestion

Instead of keeping 1000+ collections, it might help to group them in a way that reduces the total number while still keeping tenant data separate. Two patterns that work well:

1. Schema-Version Based Collections

collection_name = f"{data_type}_v{schema_version}"
metadata = {
    "tenant_id": "tenant_123",
    "schema_version": 2,
    "custom_properties": {...}
}

2. Hybrid Grouping
Group by what the data is, not who owns it:

  • user_profiles (with tenant_id property)

  • documents (with document_type property)

  • analytics_events

This could reduce the count from 1000 → ~20, improve resource usage, and make cross-tenant analytics possible.

Handling different schemas per tenant

A schema registry pattern might help:

schema_registry = {
    "tenant_123_users_v2": {
        "name": "TEXT",
        "custom_field_1": "INT"
    }
}

document = {
    "tenant_id": "tenant_123",
    "schema_id": "tenant_123_users_v2",
    "core_properties": {...},
    "extended_properties": {...}  # JSON blob
}

Extended properties could still be queried with Weaviate’s filters.

My Suggestion

  1. Immediate → Try DISABLE_GRAPHQL

  2. Next step → Consolidate collections & add a schema registry

  3. Later → Add cross-tenant analytics, auto schema versioning, monitoring

This approach might keep tenant flexibility while making the whole setup much easier to maintain and scale.

hi @Chaitanya_Kulthe and @Charlie_Chen !!

There are some other issues to consider with keeping multi customer data in the same Collection, and not separating either by collections or preferably using multi tenants.

Those are the cases you have a property customer_id that will be used to filter out the slice of data you want.

Because how HNSW indexes are built, for each new object from any customer to this single, “for all” collection, will use other customer data in order to build the index, and search, etc. While this works, it will definitely not scale.

Also, dropping a customer on that big single collection can be quite costful, and impact both performance and accuracy.

Let me know if this helps!

Thanks for the detailed breakdown @DudaNogueira

That’s a really important point about HNSW indexing that I hadn’t fully considered. You’re absolutely right - when all customer data lives in a single collection, the vector index gets built using all objects regardless of tenant, which means each customer’s nearest neighbor searches could be influenced by completely unrelated tenant data, even after filtering by customer_id.

This creates two major scaling problems:

  1. Search quality degradation - nearest neighbors might come from other tenants before filtering, leading to less relevant results

  2. Index contamination - customers with very different data distributions could negatively impact each other’s search accuracy

And you’re spot on about the deletion cost - removing a large customer’s data from a shared collection isn’t just a simple delete operation. It potentially requires index rebalancing or rebuilding, which could cause performance issues and downtime for all other tenants sharing that collection.

Given these HNSW-specific constraints, I can see why proper multi-tenancy or dedicated collections are much safer for true tenant isolation, especially at scale. The search quality and operational risks of shared collections are definitely more significant than I initially outlined.

1 Like

Thanks a lot for all the insights shared here!

I now understand that using a single large collection with just a tenant_id filter isn’t safe at scale because of HNSW index pollution, so the two real options are multi-tenancy (shared schema) or per-tenant collections. Since our tenants have very different schemas, we’ve been leaning toward per-tenant collections — but this could mean 1000+ collections, which worries me in terms of limits and operations.

So, currently, the best option for our case is DISABLE_GRAPHQL, am I right?

1 Like

This will probably surface when you reach 1000+ collections. And it is important to note that 1.000 is just a base number. It may start with 2.000 or more. It will depend on how many properties you have, as one of the issues is the graphql schema build.

And bear in mind that this will affect the GRAPHQL schema. So if you are using GRAPHQL calls and disable the GRAPHQL stack, this will affect you.

Our clients has been moving away from GRAPHQL calls to GRPC, so if you are using solely the client, it shouldn’t affect you.

A second challenge that surfaces with large number of collections is the collection loading at startup. We don’t want our pods taking hours to be ready, right?

For that, we have a lazy loading system in place that can mitigate it.

And a second feature that help on this challenge is HNSW snapshots, introduced on recent 1.31 version.

On top of those configurations, you will need to group any tenant on a multi tenancy collection that share the same properties/schema, and for each tenant that has a unique schema/properties, they have its own collection.

Let me know if that helps!

Thanks!

It dose help, thanks very much.

I have a question regarding “search quality degredation” when using a filter like “customer_id” and doing a hybrid search / vector search / keyword search. According to the docs Weaviate is using “pre-filtering”. I assumed this means that other data in the collection which does not match this filter does not have any effect on the search results? Was I wrong about that? Does that mean that when I have different knowledge for different use cases I should always have a single collection for that use case so the data of the other use cases does not negatively impact the search result quality?

hi @torbenw !!

The pre-filtering will affect the overall performance.

Let’s say you have 20 million objects, or 10 difference company_id, each with 2kk objects.

if you are searching thru all 20 million objects, and just filtering by company_id=1, it will unnecessarily need to remove the other 18 million objects from the allowed list, both while doing keyword and vector search and for both on hybrid.

On the vector indexing side, it means that it will use objects from different company_id to calculate the HNSW construction. Now, removing the objects from company_id=1 will mean that all objects that are company_id != 1 and connected to an object that is company_id=1 will need to be recalculated, requiring more CPU power.

When you have a separate collection/tenant, this is not necessary: you can just drop the entire collection/tenant.

Does that mean that when I have different knowledge for different use cases I should always have a single collection for that use case so the data of the other use cases does not negatively impact the search result quality?

This. a single collection or a collection with multi tenant.

Considering our company_id example: If you need to perform a query on multiple companies, having all companies at the same collection makes sense.

Or, you can have multiple collections/tenants and perform multi queries and then merge the results for each query.

Now, if the use case will be restricted to each company_id, meaning, you do not need cross company search, and if the properties of the collections are the same, multi tenancy is highly recommended, as you will have isolation for each company.

Let me know if this clarifies :slight_smile:

Thank you @DudaNogueira for your answer!

I’m still not quite sure I understand the problem. You are mentioning performance issues, but @Chaitanya_Kulthe was mentioning “less relevant results” and “negative impact on search accuracy”?

This creates two major scaling problems:

  1. Search quality degradation - nearest neighbors might come from other tenants before filtering, leading to less relevant results

  2. Index contamination - customers with very different data distributions could negatively impact each other’s search accuracy

In our current concept we’re not actually using a property “company_id” to filter. We allow our customers to create “data sets” (mostly used for different use cases). So we have a multi-tenancy collection with our customers as tenants and further filtering by “data_set_id”. But I think this is still very similar to the “company_id” idea, but reduces the overall (tenant) collection size by a lot. If a tenant has a lot of data sets (e.g. 10) and a lot of documents in those data sets (e.g. 200 large PDF files) and we then search with a “data_set_id” it would have to remove like 90% of the objects from that allowed list. I’m still not sure if that would have a relevant impact on the search performance and search result quality. Are we talking about single digit millisecond delays for the filtering or are we in the tripple digits? And would the search results for something filtered like “data_set_id=1” be less accurate then if the data set had it’s own collection or tenant? I could imagine using the data set as the tenant, which would remove the need for filtering.

Hi!

One thing to note is about scale. 200 large PDFs (2k objects)?, at the end of the day, is not a lot of objects to impact both on filtering performance and similarity recall. Maybe something around 1kk objects and more.

But again, some benchmark need to be done with real data to understand this impact. It may be in ms or seconds. :grimacing:

The “less relevant” results part is that, because your vector space is “polluted” with non related data to your query, it will eventually miss some related objects due to those pollutants. While on performance, it will take more time to add and remove to the index.

The idea is: instead of going over an entire library to find a magazine about gardening, you can just go to the magazine section.

And whenever you are adding a new magazine to that magazine section, you do not need to check the entire library to annotate books that are related to your new magazine.

And whenever you are removing a magazine from that section, you also do not need to remove the annotations from all the books you found while adding.

And most importantly: if you are looking for magazines about gardening, you can only search at the magazin section, and not start your search on books/manuals/anime to eventually find a magazine about gardening.

Let me know if this clarifies :slight_smile:

Also, we host weekly events, one of them being our Office Hours Feel encouraged and invited to join us!

Thanks!

1 Like

Hey @torbenw

From what I’ve seen, pre-filtering keeps unwanted objects out of the final results, but they still influence the index structure under the hood.

  • During HNSW construction, vectors from all tenants can link together.

  • At query time, Weaviate explores those cross-tenant links before discarding mismatched items which means extra hops and a bit more latency.

  • When a large tenant is deleted, every neighbor it touched needs to be re-wired, which can slow things down across the board.

A few rules of thumb that usually help:

  1. Same schema + no cross-tenant queries => use multi-tenancy (isolated shards in one collection).

  2. Different schemas or >1M vectors per tenant => better to give each tenant its own collection.

  3. Hitting ~1,000 collections? Disable GraphQL and query over gRPC/REST to avoid schema-rebuild delays.

Below ~1M total objects the overhead is usually negligible; beyond that, isolation tends to improve both speed and recall.

Hope this clears up the filter vs. index nuance

2 Likes

Thanks a lot @DudaNogueira and @Chaitanya_Kulthe!

During HNSW construction, vectors from all tenants can link together.

This is only for the case of using a “customer_id”, correct? If the tenants are tenants in a multi-tenancy collection this does not happen, because the tenants are completely separate, right?

I think in our use case using the “data set” as tenants would be the cleanest solution. That way the content of a data set does not influence the accuracy of searching in another data set.

exactly.

collection without MT => there is one HNSW graph, that contains objects for all tenants

collection with MT => one HNSW graph per Tenant

1 Like