Update existing chunks in a document with more than QUERY_MAXIMUM_RESULTS entries

Description

We have a setup where we have multiple Documents, that are chunked into Chunks. For some of these documents, we have an automated service that updates the Document daily. To correctly update the documents we:

  1. Get all UUIDs of Chunks belonging to that specific Document
  2. Use generate a deterministic uuid5 to calculate the uuids for all new chunks
  3. Figure out which chunks to delete and which chunks to add
  4. Add only the new chunks
  5. Delete the chunks that are no longer relevant

This allows us to:

  • have a fallback if any of the steps fail
  • not reupload unnecessary Chunks
  • save some cost & bandwidth

However, step 1 is giving us some challenges, as to achieve that, we need to query all existing chunks. The ā€˜normalā€™ Get with offset doesnā€™t work above QUERY_MAXIMUM_RESULTS so the only other option weā€™ve seen so far has been to use the Cursor API, which requires us to dump our entire Weaviate instance, which canā€™t be the suggested way to achieve this.

So, Iā€™m wondering how weā€™re supposed to solve this problem, we canā€™t find anything in the documentation so far, and weā€™re slightly scared of the implications of increasing the QUERY_MAXIMUM_RESULTS.

Server Setup Information

  • Weaviate Server Version: 1.24.6
  • Deployment Method: Docker
  • Multi Node? Number of Running Nodes: 1
  • Client Language and Version: Python v3
  • Multitenancy?: Nope

Any additional Information

Not really

Hello!

You could:

  • have a second class documents which references all chunks that belong to a given document
  • add another property document to your Chunk-class, which (for example) contains the ID of the document the chunk belongs to (make sure that you donā€™t include that field in your vectorization). When you want to get the chunks belonging to a document simply use a GET+ filter for that ID

Hi @Dirk , thanks, but either I donā€™t understand your proposed solution completely, or it will unfortunately not solve our issue.

We currently already have the following 2 schemaā€™s:

  1. WeaviateDocument, which has information about the document
  2. WeaviateChunk, which has the chunkā€™s text, and it also has an inDocument cross-reference to a WeaviateDocument.

When we want to update, we follow the steps outlined in my initial message.
(Step 1 already contains a filter like here).

However, if we start loading the existing chunks batch by batch, as soon as limit + offset > QUERY_MAXIMUM_RESULTS, we will get an error instead of getting more results. Thatā€™s where we actually run into the issues. And I donā€™t completely understand how your proposal would help us circumvent these issues. Could you please elaborate?

ahh I assumed that you were iterating through all of your Chunks and that that class was approaching QUERY_MAXIMUM_RESULTS.

So just to be sure: you have documents that have more chunks than QUERY_MAXIMUM_RESULTS?

Yes, we have quite a few documents that have (significantly) more chunks than QUERY_MAXIMUM_RESULTS

ok, I see!

What I would try, but am not 100% sure if that works and how it scales:

  • add a reference from documents => chunks
  • request a document including the reference to get all chunk IDs belonging to that document (for python v3 youā€™d need to parse the uri that is returned)
  • now you can query the chunks one by one and update/delete/add new ons
  • afterwards you need to update the references (eg delete ones you donā€™t need anymore)

Other possibility would be to increase QUERY_MAXIMUM_RESULTS. It is there to limit the memory usage of weaviate. Eg if you do a get query with offset=QUERY_MAXIMUM_RESULTS-100 and limit=100 it loades QUERY_MAXIMUM_RESULTS results into memory to then get the last 100 objects (unless you use the cursor API, which does not work for filters). Depending on how beefy your server is it might be ok to increase the value. I would do an expensive test query, see how much the memory spikes and then see if you can increase the value

@DudaNogueira do you have another idea?

Hi @Dirk,

Thanks for the proposal. It seems like something that could work indeed, but I would highly prefer if we could find a solution that doesnā€™t involve maintaining a link from Chunk ā€”> Document and a link from Document ā€”> Chunk, as that seems like the source of a lot of unexpected behaviours.

Have you since thought about other potential solutions? Or would there be anyone else we could ask?

Hi @afstkla !

If you have a fixed id for each chunk you could also use deterministic ids, and use batch.

Now when batch finds an existing UUID, it will update it, and when not found, it will insert it.

Do you believe this could help somehow?

Hi @DudaNogueira, thanks for the suggestion.

Yes, we already do that, but the challenge is removing old chunks.

Every time we update, some chunks will be out of date, and we have no way of finding these. So if we go your route, then unfortunately we will keep those outdated chunks (which will contain wrong information).

So I donā€™t think we can so that unfortunately.

Or am I missing something?

Hum. I see.

Yeah, not sure this can be done differently from the suggested options :thinking:

Sad, feels like a use-case other users might run intoā€¦ If only the cursor API would allow for filtering everything would be golden.

For now (and for other people reading this thread), weā€™ll probably go for:

  1. Grab the updated document & chunk it as normal & calculate deterministic UUIDs
  2. Create a completely new Weaviate Document
  3. Batch by batch, 1) get chunks from the old Document, 2) upload all chunks with UUIDs that havenā€™t changed to the new Document, 3) remove the batch from the old Document
  4. Upload the remaining (aka new) Chunks to the new Document
  5. Remove the old Document

Thatā€™s so far the best solution with the least impact on continuity weā€™ve been able to come up with (though it does require us to revamp a lot of the stuff that weā€™ve built, which is quite unfortunate).

Do you know if anything to solve issues like this is planned for any upcoming releases?