Update existing chunks in a document with more than QUERY_MAXIMUM_RESULTS entries

afstkla · November 7, 2024, 7:52am

Description

We have a setup where we have multiple Documents, that are chunked into Chunks. For some of these documents, we have an automated service that updates the Document daily. To correctly update the documents we:

Get all UUIDs of Chunks belonging to that specific Document
Use generate a deterministic uuid5 to calculate the uuids for all new chunks
Figure out which chunks to delete and which chunks to add
Add only the new chunks
Delete the chunks that are no longer relevant

This allows us to:

have a fallback if any of the steps fail
not reupload unnecessary Chunks
save some cost & bandwidth

However, step 1 is giving us some challenges, as to achieve that, we need to query all existing chunks. The ‘normal’ Get with offset doesn’t work above QUERY_MAXIMUM_RESULTS so the only other option we’ve seen so far has been to use the Cursor API, which requires us to dump our entire Weaviate instance, which can’t be the suggested way to achieve this.

So, I’m wondering how we’re supposed to solve this problem, we can’t find anything in the documentation so far, and we’re slightly scared of the implications of increasing the QUERY_MAXIMUM_RESULTS.

Server Setup Information

Weaviate Server Version: 1.24.6
Deployment Method: Docker
Multi Node? Number of Running Nodes: 1
Client Language and Version: Python v3
Multitenancy?: Nope

Any additional Information

Not really

Dirk · November 7, 2024, 9:05am

Hello!

You could:

have a second class documents which references all chunks that belong to a given document
add another property document to your Chunk-class, which (for example) contains the ID of the document the chunk belongs to (make sure that you don’t include that field in your vectorization). When you want to get the chunks belonging to a document simply use a GET+ filter for that ID

afstkla · November 8, 2024, 7:39am

Hi @Dirk , thanks, but either I don’t understand your proposed solution completely, or it will unfortunately not solve our issue.

We currently already have the following 2 schema’s:

WeaviateDocument, which has information about the document
WeaviateChunk, which has the chunk’s text, and it also has an inDocument cross-reference to a WeaviateDocument.

When we want to update, we follow the steps outlined in my initial message.
(Step 1 already contains a filter like here).

However, if we start loading the existing chunks batch by batch, as soon as limit + offset > QUERY_MAXIMUM_RESULTS, we will get an error instead of getting more results. That’s where we actually run into the issues. And I don’t completely understand how your proposal would help us circumvent these issues. Could you please elaborate?

Dirk · November 8, 2024, 2:42pm

ahh I assumed that you were iterating through all of your Chunks and that that class was approaching QUERY_MAXIMUM_RESULTS.

So just to be sure: you have documents that have more chunks than QUERY_MAXIMUM_RESULTS?

afstkla · November 8, 2024, 5:07pm

Yes, we have quite a few documents that have (significantly) more chunks than QUERY_MAXIMUM_RESULTS

Dirk · November 8, 2024, 7:17pm

ok, I see!

What I would try, but am not 100% sure if that works and how it scales:

add a reference from documents => chunks
request a document including the reference to get all chunk IDs belonging to that document (for python v3 you’d need to parse the uri that is returned)
now you can query the chunks one by one and update/delete/add new ons
afterwards you need to update the references (eg delete ones you don’t need anymore)

Other possibility would be to increase QUERY_MAXIMUM_RESULTS. It is there to limit the memory usage of weaviate. Eg if you do a get query with offset=QUERY_MAXIMUM_RESULTS-100 and limit=100 it loades QUERY_MAXIMUM_RESULTS results into memory to then get the last 100 objects (unless you use the cursor API, which does not work for filters). Depending on how beefy your server is it might be ok to increase the value. I would do an expensive test query, see how much the memory spikes and then see if you can increase the value

@DudaNogueira do you have another idea?

afstkla · November 12, 2024, 6:42am

Hi @Dirk,

Thanks for the proposal. It seems like something that could work indeed, but I would highly prefer if we could find a solution that doesn’t involve maintaining a link from Chunk —> Document and a link from Document —> Chunk, as that seems like the source of a lot of unexpected behaviours.

Have you since thought about other potential solutions? Or would there be anyone else we could ask?

DudaNogueira · November 12, 2024, 2:42pm

Hi @afstkla !

If you have a fixed id for each chunk you could also use deterministic ids, and use batch.

Now when batch finds an existing UUID, it will update it, and when not found, it will insert it.

Do you believe this could help somehow?

afstkla · November 12, 2024, 2:56pm

Hi @DudaNogueira, thanks for the suggestion.

Yes, we already do that, but the challenge is removing old chunks.

Every time we update, some chunks will be out of date, and we have no way of finding these. So if we go your route, then unfortunately we will keep those outdated chunks (which will contain wrong information).

So I don’t think we can so that unfortunately.

Or am I missing something?

DudaNogueira · November 12, 2024, 7:22pm

Hum. I see.

Yeah, not sure this can be done differently from the suggested options

afstkla · November 12, 2024, 8:04pm

Sad, feels like a use-case other users might run into… If only the cursor API would allow for filtering everything would be golden.

For now (and for other people reading this thread), we’ll probably go for:

Grab the updated document & chunk it as normal & calculate deterministic UUIDs
Create a completely new Weaviate Document
Batch by batch, 1) get chunks from the old Document, 2) upload all chunks with UUIDs that haven’t changed to the new Document, 3) remove the batch from the old Document
Upload the remaining (aka new) Chunks to the new Document
Remove the old Document

That’s so far the best solution with the least impact on continuity we’ve been able to come up with (though it does require us to revamp a lot of the stuff that we’ve built, which is quite unfortunate).

Do you know if anything to solve issues like this is planned for any upcoming releases?

Topic		Replies	Views
Batch Upsert functionality General	6	751	January 6, 2025
Slow deletion when using filter (and updating chunked documents) Support	2	594	June 30, 2023
How to get unique results based on references General	6	456	March 9, 2024
[Question] How to retrieve all documents in weaviate? Support	2	418	August 14, 2024
Python client v4 batch create reference issue Support bug , developer-experience , python	4	400	February 9, 2024

Update existing chunks in a document with more than QUERY_MAXIMUM_RESULTS entries

Description

Server Setup Information

Any additional Information

Related topics