We have a setup where we have multiple Documents, that are chunked into Chunks. For some of these documents, we have an automated service that updates the Document daily. To correctly update the documents we:
Get all UUIDs of Chunks belonging to that specific Document
Use generate a deterministic uuid5 to calculate the uuids for all new chunks
Figure out which chunks to delete and which chunks to add
Add only the new chunks
Delete the chunks that are no longer relevant
This allows us to:
have a fallback if any of the steps fail
not reupload unnecessary Chunks
save some cost & bandwidth
However, step 1 is giving us some challenges, as to achieve that, we need to query all existing chunks. The ānormalā Get with offset doesnāt work above QUERY_MAXIMUM_RESULTS so the only other option weāve seen so far has been to use the Cursor API, which requires us to dump our entire Weaviate instance, which canāt be the suggested way to achieve this.
So, Iām wondering how weāre supposed to solve this problem, we canāt find anything in the documentation so far, and weāre slightly scared of the implications of increasing the QUERY_MAXIMUM_RESULTS.
add another property document to your Chunk-class, which (for example) contains the ID of the document the chunk belongs to (make sure that you donāt include that field in your vectorization). When you want to get the chunks belonging to a document simply use a GET+ filter for that ID
Hi @Dirk , thanks, but either I donāt understand your proposed solution completely, or it will unfortunately not solve our issue.
We currently already have the following 2 schemaās:
WeaviateDocument, which has information about the document
WeaviateChunk, which has the chunkās text, and it also has an inDocument cross-reference to a WeaviateDocument.
When we want to update, we follow the steps outlined in my initial message.
(Step 1 already contains a filter like here).
However, if we start loading the existing chunks batch by batch, as soon as limit + offset > QUERY_MAXIMUM_RESULTS, we will get an error instead of getting more results. Thatās where we actually run into the issues. And I donāt completely understand how your proposal would help us circumvent these issues. Could you please elaborate?
What I would try, but am not 100% sure if that works and how it scales:
add a reference from documents => chunks
request a document including the reference to get all chunk IDs belonging to that document (for python v3 youād need to parse the uri that is returned)
now you can query the chunks one by one and update/delete/add new ons
afterwards you need to update the references (eg delete ones you donāt need anymore)
Other possibility would be to increase QUERY_MAXIMUM_RESULTS. It is there to limit the memory usage of weaviate. Eg if you do a get query with offset=QUERY_MAXIMUM_RESULTS-100 and limit=100 it loades QUERY_MAXIMUM_RESULTS results into memory to then get the last 100 objects (unless you use the cursor API, which does not work for filters). Depending on how beefy your server is it might be ok to increase the value. I would do an expensive test query, see how much the memory spikes and then see if you can increase the value
Thanks for the proposal. It seems like something that could work indeed, but I would highly prefer if we could find a solution that doesnāt involve maintaining a link from Chunk ā> Document and a link from Document ā> Chunk, as that seems like the source of a lot of unexpected behaviours.
Have you since thought about other potential solutions? Or would there be anyone else we could ask?
Yes, we already do that, but the challenge is removing old chunks.
Every time we update, some chunks will be out of date, and we have no way of finding these. So if we go your route, then unfortunately we will keep those outdated chunks (which will contain wrong information).
Sad, feels like a use-case other users might run intoā¦ If only the cursor API would allow for filtering everything would be golden.
For now (and for other people reading this thread), weāll probably go for:
Grab the updated document & chunk it as normal & calculate deterministic UUIDs
Create a completely new Weaviate Document
Batch by batch, 1) get chunks from the old Document, 2) upload all chunks with UUIDs that havenāt changed to the new Document, 3) remove the batch from the old Document
Upload the remaining (aka new) Chunks to the new Document
Remove the old Document
Thatās so far the best solution with the least impact on continuity weāve been able to come up with (though it does require us to revamp a lot of the stuff that weāve built, which is quite unfortunate).
Do you know if anything to solve issues like this is planned for any upcoming releases?