Database Partitioning Performance Comparison

I am evaluating two database partitioning strategies for a dataset containing one million objects with identical schema structure, and I would like to understand the performance implications of each approach.

Scenario 1 - Horizontal Sharding:
The dataset is distributed across 4 database shards, with each shard containing approximately 250,000 objects.

Scenario 2 - Multi-tenant Architecture:
The same one million objects are organized into 4 tenant partitions within the database infrastructure.

Configuration Details:

  • Both scenarios implement a replication factor of 1

  • Dataset size and schema remain consistent across both approaches

Question:
Which partitioning strategy would deliver superior performance for:

  1. Write operations (INSERT/UPDATE/DELETE)

  2. Read query execution

I would appreciate insights into the performance characteristics, potential bottlenecks, and scalability considerations for each approach.

Hey @Saketh,

Both are reasonable. The key difference is: if you have a schema that is identical across use cases — meaning you might end up creating collections that have the same properties and config but are only separated by purpose (like serving different users or knowledge pages such as in RAG) then tenants are the better choice. Each tenant residencce on a shard.

That way you avoid building the same schema multiple times - many collections, which at large scale can take time and delay DB pods becoming ready to serve.

If you’re not in that scenario, then sharding will indeed help with distributing the data. Either way, performance is strong when replication is involved e.g., 3 pods serving instead of just one.

In Sharding, when a request comes in (like a vector search), the coordinator node sends the request to the nodes that hold shards of that collection. Each node returns the portion it has, and the coordinator merges them into the result (in case of replication). That’s why: if you don’t have fault tolerance, don’t shard. In short: sharding helps split very large datasets that are too big for a single node. After-all, I would go with MT in a single node.

If you are not going to ingest a lot of data then it will work just fine but keep in mind at a large scale, one node is going to struggle regardless the sharding strategy.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)