Skewed dataset with multi-tenant collection

First time working with weaviate and we’re hosting a cluster ourselves. We’re thinking of having a multi-tenant collection where each customer is a separate tenant. We have ~20k customers but also have a very skewed dataset. Some customers have over 270G of data while others have a few MBs.

Was wondering, in a multi-tenant set up, can you have many shards for a single tenant hosted in different nodes? That would allow us to distribute load for large tenants over multiple nodes for both data size growth and cpu resources when querying. Is this possible, and how should we create our cluster to guarantee this?

Hi @dmwaura ! Welcome to our community :hugs:

By default, the ShardingConfig > desiredCount of a new class will be the number of nodes in your cluster.

This means that it will spread your shards across your cluster, and accomplishing what you are looking for.

Let me know if this helps :slight_smile:

Thanks!

Thanks! I just learned that a single tenant cannot consist of many shards, so in that case, some shards might just end up being larger than others for a skewed dataset.

Some other things I’ve noticed though:

  1. It doesn’t look like I can pre-create shards for a multi-tenant class unfortunately. When I tried that, I got the error below from the create request:

{"error":[{"message":"cannot have both shardingConfig and multiTenancyConfig"}]}

  1. In a cluster with 3 nodes, incremental creation of tenants results with all created shards appearing on the first node while other nodes remain empty. I’d have expected shards to get evenly distributed across the cluster?

See response below from the /v1/nodes endpoint after creating all my testing tenants and loading them with data:

curl 'http://weaviate.camera.svc.cluster.local/v1/nodes'

{
  "nodes": [
  {
    "batchStats": {
        "queueLength": 0,
        "ratePerSecond": 0
      },
      "gitHash": "f381d44",
      "name": "weaviate-0",
      "shards": [
        {
          "class": "ClipEmbeddingsv2",
          "name": "00000000-0000-4000-8000-000000001077",
          "objectCount": 138,
          "vectorIndexingStatus": "READY",
          "vectorQueueLength": 0
        },
        
         .....  // Many entries here

        {
          "class": "ClipEmbeddingsv2",
          "name": "00000000-0000-4000-8000-000000002565",
          "objectCount": 215,
          "vectorIndexingStatus": "READY",
          "vectorQueueLength": 0
        }
       ],
       "stats": {
        "objectCount": 2656810,
        "shardCount": 4980
      },
      "status": "HEALTHY",
      "version": "1.22.5"
    },
    {
      "batchStats": {
        "queueLength": 0,
        "ratePerSecond": 0
      },
      "gitHash": "f381d44",
      "name": "weaviate-1",
      "shards": null,
      "stats": {
        "objectCount": 0,
        "shardCount": 0
      },
      "status": "HEALTHY",
      "version": "1.22.5"
    },
    {
      "batchStats": {
        "queueLength": 0,
        "ratePerSecond": 0
      },
      "gitHash": "f381d44",
      "name": "weaviate-2",
      "shards": null,
      "stats": {
        "objectCount": 0,
        "shardCount": 0
      },
      "status": "HEALTHY",
      "version": "1.22.5"
    }
  ]
}