Best practice of representing sources for RAG applications

bahtman · July 4, 2024, 8:35am

Description

We are currently ingesting multiple documents from 3-4 sources into a single Class. We would like users of our RAG application to choose which of the sources to use for grounding answers.

Our proposed solution would be to use filters at query time to get relevant chunks. We worry that having a single Class, with nightly delta loads to update the chunks might be too instable. (We’ve had issues with delete operations corrupting a Class)

Is there a best practice, for this particular setup?

Thank you in advance!
Anton

Server Setup Information

Weaviate Server Version: 1.25.0
Deployment Method: k8s
Multi Node? Number of Running Nodes: 2
Client Language and Version: Python 4.5.2
Multitenancy?: Not currently

DudaNogueira · July 4, 2024, 8:52pm

hi @bahtman !! Welcome to our community

Could you elaborate more on the issues you had with nightly delta loads and how are you doing it? Also, was it in latest 1.25 or older versions?

If you have have a high number of update/delete operations, a good tip is to tune the TOMBSTONE env var. Also monitoring those metrics.

When you delete an object, Weaviate will not delete it right away, as those operations are costly. It will mark it as deleted and process it later.

More info on this here: Weaviate, a vector database with ANN Index and CRUD support | Weaviate - Vector Database

Regarding your question, you could treat each document as a tenant, or a separate collection, however, it wouldn’t be possible to let the user to select for more than one of those documents to perform a query.

With that said, having a property to filter on document is usually the best approach.

As a best practice, I believe you could check out how we do it in Verba: GitHub - weaviate/Verba: Retrieval Augmented Generation (RAG) chatbot powered by Weaviate.

Let me know if this helps!

Thanks!

ottman · July 8, 2024, 9:49am

Hi @DudaNogueira

I am a colleague of @bahtman.

We are currently using version 1.25.4, and has since had a “corruption” issue again. Our delta load runs every hour, and does a “delete_many” operation from the Python library (V. 4.5.2). Seemingly randomly the collection becomes corrupted, and will return

“Query call with protocol GRPC search failed with message explorer: get class: vector search: object vector search at index xyz: shard xyz_6vkltMpybcdF: vector search: entrypoint was deleted in the object store, it has been flagged for cleanup and should be fixed in the next cleanup cycle”

If i check the logs in k8s i can also find this operation;

{“action”:“hybrid”,“error”:“explorer: get class: vector search: object vector search at index xyz: remote shard 6vkltMpybcdF: status code: 500, error: shard xyz_6vkltMpybcdF: vector search: entrypoint was deleted in the object store, it has been flagged for cleanup and should be fixed in the next cleanup cycle\n: context deadline exceeded”,“level”:“error”,“msg”:“denseSearch failed”,“time”:“2024-07-05T11:13:40Z”}

This happened friday, and if i try to query the collection now it will still throw the same error.

What do you propose here for settings with regards to tombstone? We are okay with deleting often if it means we can avoid this “corruption” issue

Best regards,
Anders

DudaNogueira · July 8, 2024, 5:40pm

hi @ottman !

Have you tried defining a value for TOMBSTONE_DELETION_MAX_PER_CYCLE?

This can help. Also, I suggest upgrading to the latest 1.25.X as those issues have surfaced recently and there are patches on latest verison, IIRC

ottman · July 9, 2024, 6:03am

Hi @DudaNogueira, thanks for the quick reply.

I just checked again and we are actually using version 1.25.6 where this problem occur. What kind of value would you recommend for TOMBSTONE_DELETION_MAX_PER_CYCLE? Its a fairly small cluster (for now) so we only have ~ 10k object shards on each node.

DudaNogueira · July 9, 2024, 11:17am

Hi!

on that case, this shouldn’t interfere.

as per the doc:

Maximum number of tombstones to delete per cleanup cycle. Set this to limit cleanup cycles, as they are resource-intensive. As an example, set a maximum of 10000000 (10M) for a cluster with 300 million-object shards. (Default: none)

10k objects isn’t that much to consume significant resources to degrade performance.

ottman · July 9, 2024, 11:50am

That sounds good @DudaNogueira. Do you have any other idea why we end up with the corrupted collection/shard then? The only fix is essentially to delete the collection and recreate it - but eventually it happens again.

DudaNogueira · July 9, 2024, 2:22pm

We have just released a patch (1.25.7) and on previous versions there were some fix around that.

Are you running 1.25.0?

Do you see the same results on latest 1.25.7?

ottman · July 10, 2024, 6:32am

That sounds great @DudaNogueira. We are at 1.25.6 currently (helm version 17.1.0). I would like to upgrade to 1.25.7, but there are not helmchart with that version yet.

DudaNogueira · July 11, 2024, 8:26pm

hi!

You can define specific Weaviate version on your values.yml here

So you change the values.yaml and upgrade the helm deployment.

Let me know if that helps.

Thanks!

ottman · July 12, 2024, 12:24pm

I tried updating to 1.25.7 - i will return next week with whether it helped or not. Thank you for the help so far!

Topic		Replies	Views
Workflow for tenanted RAG, temporal and long standing vector storage/querying Support	3	315	July 8, 2024
Creating and deleting class take > 1m and search sometimes only return 204 Support	3	586	January 3, 2024
Slow deletion when using filter (and updating chunked documents) Support	2	883	June 30, 2023
Correct way to store embeddings Support	2	883	October 3, 2023
Embedded weaviate Support developer-experience	2	493	June 25, 2024

Best practice of representing sources for RAG applications

Description

Server Setup Information

Related topics