What is an issue of using different classes for vector search scoping?

Hi, I’m referring this blog: Multi-Tenancy Vector Search with millions of tenants | Weaviate - vector database

The blog says that it becomes very difficult to run more than 5~10,000 tenants if you differentiate tenants with classes. This sounds like to me that it is difficult to have more than that number of classes in Weaviate. But as far as I searched, Weaviate doesn’t have an explicit limitation except vector dimension, as long as HW resources permit. Can somebody elaborate?

Actually, I am thinking about how we can organize different service user data so that they are isolated vector search. (please note that I am NOT talking about Weaviate’s tenant by service user, but it is the user of an application stores user data in Weaviate). I thought that we may assign different class to service users but ran into that blog post.

Hi @roengram !

I believe the main concern here is around scalability and data management. if you go with the one client per class approach (not recommended), and get into thousands os clients/classes, for example, your shards will not be separated properly per client, so deleting one clients data that requires a lot of operations (a big deletion, or a big data import), can have a direct effect in other clients.

On the other hand, separating the clients per tenant, allows Weaviate to better handle does operations. Also you get the benefit to put the tenant to a cold status, and save resources for the disabled tenants

let me know if that helps :slight_smile:

Hi, Thank for your reply! It definitely helps!

So it seems that a shard can contain multiple classes’ objects. I thought a shard contains only the objects from a single class. And from your reply and this document, I understand that a shard contains only the objects owned by a single tenant. Please correct me if I’m wrong.

The document also says a node can support 50,000+ active shards, so 1M active tenants with 20 nodes. I guess the assumtion is one shard per tenant. And the number 50,000+ seems to come from the number of available file descriptors in a node. Is this correct? And if so, would it be correct if I think the number of file descriptors is the actual limit factor for the number of tenants and active shards?