Batch Upsert functionality

JK_Rider · August 21, 2024, 11:29pm

Hi Weaviate Team,

A common use case for our team is inserting and updating many documents (>1K) simultaneously. I wanted to know if batch upsert functionality is offered on Weaviate or if another efficient method exists. Thank you,

DudaNogueira · August 22, 2024, 6:02pm

hi @JK_Rider !!

You can do upserts using batch.

For that you need to use deterministic ids. You can generate those UUIDs from one of your ids. We have some examples here:

Here is a simple example:

import weaviate
client = weaviate.connect_to_local()
from weaviate.classes.query import Filter
from weaviate.util import generate_uuid5

client.collections.delete("Test")
collection = client.collections.create(name="Test")

objects = [
    {"reference_id": 1, "content": "this is a first content"},
    {"reference_id": 2, "content": "this is a second content"}
]

with collection.batch.dynamic() as batch:
    for data_row in objects:
        batch.add_object(
            properties=data_row,
            uuid=generate_uuid5(data_row.get("reference_id"))
        )
for o in collection.query.fetch_objects().objects:
    print(o.properties)

this will output

{‘reference_id’: 2.0, ‘content’: ‘this is a second content’}
{‘reference_id’: 1.0, ‘content’: ‘this is a first content’}

now we upsert

objects = [
    {"reference_id": 1, "content": "this is NEW a first content"},
    {"reference_id": 3, "content": "this is a third content"}
]

with collection.batch.dynamic() as batch:
    for data_row in objects:
        batch.add_object(
            properties=data_row,
            uuid=generate_uuid5(data_row.get("reference_id"))
        )

for o in collection.query.fetch_objects().objects:
    print(o.properties)

will output:

{‘content’: ‘this is a second content’, ‘reference_id’: 2.0}
{‘reference_id’: 3.0, ‘content’: ‘this is a third content’}
{‘reference_id’: 1.0, ‘content’: ‘this is NEW a first content’}

Let me know if this helps!

Thanks!

JK_Rider · August 22, 2024, 6:21pm

Awesome, @DudaNogueira, I wasn’t sure if that automatically updated or if it would error. Thanks, for the confirmation.

Nick_Youngblut · January 6, 2025, 12:39am

@DudaNogueira using reference_id to upsert seems to break down with adaptive chunking (e.g., via semchunk).

Simply using a reference_id with such chunking would not guarantee that the correct chunk would be updated, since the number and content of chunks could change between each chunking operation.

Maybe using hashlib (or similar) on the whole document is needed, and then delete + recreate all chunks of the doc if the whole-doc hash has changed?

DudaNogueira · January 6, 2025, 6:15pm

hi @Nick_Youngblut !!

Welcome to our community

That’s a common solution. Here is a code sample for that:

from weaviate.util import generate_uuid5
properties = {
    "text": "this is a text",
    "category": "cat1"
}
uuid = generate_uuid5(properties)
print(uuid)

So in the case you have changed properties values you can delete the old object and update it accordingly.

Let me know if that helps!

Thanks!

Nick_Youngblut · January 6, 2025, 6:32pm

Thanks @DudaNogueira !

I’m still not seeing how an upsert workflow can feasibly work with adaptive chunking of documents.

Since the number of content of each chunk can change, how could one upsert certain chunks? Must one label all chunks from a document with the same “parent ID” and then delete all of those chunks? How else would one guarantee that all of the doc chunks have been properly upserted (or deleted) as the chunks change in subsequent rounds of chunking?

DudaNogueira · January 6, 2025, 7:54pm

Any time

For chunking, there isn’t a one recipe fits all.

But definitely, if you have a document that changed one part and your chunking strategy is not static, you need to delete the entire document and reindex it, or try updating only the affected chunks.

And depending on your use case it may be ok to have some overlap.

Anyway, have you seen this academy on that subject?

THanks!

Topic		Replies	Views
Slow deletion when using filter (and updating chunked documents) Support	2	769	June 30, 2023
Update existing chunks in a document with more than QUERY_MAXIMUM_RESULTS entries Support	10	543	November 12, 2024
Thoughts on upsert Support	2	799	November 21, 2023
How to insert to two schemas in a transaction Support	3	663	November 2, 2023
Python client v4 batch create reference issue Support bug , developer-experience , python	4	650	February 9, 2024

Batch Upsert functionality

Related topics