We’re working on periodically updating our vector store with some data and we use chunking (~2000 characters per document) to keep things small. Because of this, we can’t easily “upsert” documents using UUID since those documents actually belong to a parent document. For example, take a Google Doc that is 10,000 words long. We might chunk that into 20 Weaviate documents with a “parent_uuid” that matches across all of them. If the content changes somewhere in the middle of the document, it’s very hard to chunk the document the same way.
Our solution was to use a “parent_uuid” that references the document and is available as metadata on each Weaviate document. Our “upsert” becomes a deletion based on “parent_uuid” and then new inserts.
The issue we’re running into is that deleting via the bulk api by “parent_uuid” is extremely slow (only 20 documents to delete from Weaviate). We have an index on “parent_uuid”.
Has anyone run into issues like this or have a better solution for “upserting” documents that are chunked in Weaviate? Should we not chunk things like this in Weaviate and just insert the entire document, no matter what size? Is this what the relationships in Weaviate are for? I feel like we are missing something obvious here.