Best practice of representing sources for RAG applications


We are currently ingesting multiple documents from 3-4 sources into a single Class. We would like users of our RAG application to choose which of the sources to use for grounding answers.

Our proposed solution would be to use filters at query time to get relevant chunks. We worry that having a single Class, with nightly delta loads to update the chunks might be too instable. (We’ve had issues with delete operations corrupting a Class)

Is there a best practice, for this particular setup?

Thank you in advance!

Server Setup Information

  • Weaviate Server Version: 1.25.0
  • Deployment Method: k8s
  • Multi Node? Number of Running Nodes: 2
  • Client Language and Version: Python 4.5.2
  • Multitenancy?: Not currently

hi @bahtman !! Welcome to our community :hugs:

Could you elaborate more on the issues you had with nightly delta loads and how are you doing it? Also, was it in latest 1.25 or older versions?

If you have have a high number of update/delete operations, a good tip is to tune the TOMBSTONE env var. Also monitoring those metrics.

When you delete an object, Weaviate will not delete it right away, as those operations are costly. It will mark it as deleted and process it later.

More info on this here: Weaviate, a vector database with ANN Index and CRUD support | Weaviate - Vector Database

Regarding your question, you could treat each document as a tenant, or a separate collection, however, it wouldn’t be possible to let the user to select for more than one of those documents to perform a query.

With that said, having a property to filter on document is usually the best approach.

As a best practice, I believe you could check out how we do it in Verba: GitHub - weaviate/Verba: Retrieval Augmented Generation (RAG) chatbot powered by Weaviate.

Let me know if this helps!


Hi @DudaNogueira

I am a colleague of @bahtman.

We are currently using version 1.25.4, and has since had a “corruption” issue again. Our delta load runs every hour, and does a “delete_many” operation from the Python library (V. 4.5.2). Seemingly randomly the collection becomes corrupted, and will return

“Query call with protocol GRPC search failed with message explorer: get class: vector search: object vector search at index xyz: shard xyz_6vkltMpybcdF: vector search: entrypoint was deleted in the object store, it has been flagged for cleanup and should be fixed in the next cleanup cycle”

If i check the logs in k8s i can also find this operation;

{“action”:“hybrid”,“error”:“explorer: get class: vector search: object vector search at index xyz: remote shard 6vkltMpybcdF: status code: 500, error: shard xyz_6vkltMpybcdF: vector search: entrypoint was deleted in the object store, it has been flagged for cleanup and should be fixed in the next cleanup cycle\n: context deadline exceeded”,“level”:“error”,“msg”:“denseSearch failed”,“time”:“2024-07-05T11:13:40Z”}

This happened friday, and if i try to query the collection now it will still throw the same error.

What do you propose here for settings with regards to tombstone? We are okay with deleting often if it means we can avoid this “corruption” issue :slight_smile:

Best regards,

hi @ottman !

Have you tried defining a value for TOMBSTONE_DELETION_MAX_PER_CYCLE?

This can help. Also, I suggest upgrading to the latest 1.25.X as those issues have surfaced recently and there are patches on latest verison, IIRC

Hi @DudaNogueira, thanks for the quick reply.

I just checked again and we are actually using version 1.25.6 where this problem occur. What kind of value would you recommend for TOMBSTONE_DELETION_MAX_PER_CYCLE? Its a fairly small cluster (for now) so we only have ~ 10k object shards on each node.


on that case, this shouldn’t interfere.

as per the doc:

Maximum number of tombstones to delete per cleanup cycle. Set this to limit cleanup cycles, as they are resource-intensive. As an example, set a maximum of 10000000 (10M) for a cluster with 300 million-object shards. (Default: none)

10k objects isn’t that much to consume significant resources to degrade performance.

That sounds good @DudaNogueira. Do you have any other idea why we end up with the corrupted collection/shard then? The only fix is essentially to delete the collection and recreate it - but eventually it happens again.

We have just released a patch (1.25.7) and on previous versions there were some fix around that.

Are you running 1.25.0?

Do you see the same results on latest 1.25.7?

That sounds great @DudaNogueira. We are at 1.25.6 currently (helm version 17.1.0). I would like to upgrade to 1.25.7, but there are not helmchart with that version yet.


You can define specific Weaviate version on your values.yml here

So you change the values.yaml and upgrade the helm deployment.

Let me know if that helps.


I tried updating to 1.25.7 - i will return next week with whether it helped or not. Thank you for the help so far!

1 Like