Batch Upsert functionality

Hi Weaviate Team,

A common use case for our team is inserting and updating many documents (>1K) simultaneously. I wanted to know if batch upsert functionality is offered on Weaviate or if another efficient method exists. Thank you,

hi @JK_Rider !!

You can do upserts using batch.

For that you need to use deterministic ids. You can generate those UUIDs from one of your ids. We have some examples here:

Here is a simple example:

import weaviate
client = weaviate.connect_to_local()
from weaviate.classes.query import Filter
from weaviate.util import generate_uuid5

client.collections.delete("Test")
collection = client.collections.create(name="Test")

objects = [
    {"reference_id": 1, "content": "this is a first content"},
    {"reference_id": 2, "content": "this is a second content"}
]

with collection.batch.dynamic() as batch:
    for data_row in objects:
        batch.add_object(
            properties=data_row,
            uuid=generate_uuid5(data_row.get("reference_id"))
        )
for o in collection.query.fetch_objects().objects:
    print(o.properties)

this will output

{ā€˜reference_idā€™: 2.0, ā€˜contentā€™: ā€˜this is a second contentā€™}
{ā€˜reference_idā€™: 1.0, ā€˜contentā€™: ā€˜this is a first contentā€™}

now we upsert

objects = [
    {"reference_id": 1, "content": "this is NEW a first content"},
    {"reference_id": 3, "content": "this is a third content"}
]

with collection.batch.dynamic() as batch:
    for data_row in objects:
        batch.add_object(
            properties=data_row,
            uuid=generate_uuid5(data_row.get("reference_id"))
        )

for o in collection.query.fetch_objects().objects:
    print(o.properties)

will output:

{ā€˜contentā€™: ā€˜this is a second contentā€™, ā€˜reference_idā€™: 2.0}
{ā€˜reference_idā€™: 3.0, ā€˜contentā€™: ā€˜this is a third contentā€™}
{ā€˜reference_idā€™: 1.0, ā€˜contentā€™: ā€˜this is NEW a first contentā€™}

Let me know if this helps!

Thanks!

Awesome, @DudaNogueira, I wasnā€™t sure if that automatically updated or if it would error. Thanks, for the confirmation. :slight_smile:

1 Like

@DudaNogueira using reference_id to upsert seems to break down with adaptive chunking (e.g., via semchunk).

Simply using a reference_id with such chunking would not guarantee that the correct chunk would be updated, since the number and content of chunks could change between each chunking operation.

Maybe using hashlib (or similar) on the whole document is needed, and then delete + recreate all chunks of the doc if the whole-doc hash has changed?

hi @Nick_Youngblut !!

Welcome to our community :hugs:

Thatā€™s a common solution. Here is a code sample for that:

from weaviate.util import generate_uuid5
properties = {
    "text": "this is a text",
    "category": "cat1"
}
uuid = generate_uuid5(properties)
print(uuid)

So in the case you have changed properties values you can delete the old object and update it accordingly.

Let me know if that helps!

Thanks!

Thanks @DudaNogueira !

Iā€™m still not seeing how an upsert workflow can feasibly work with adaptive chunking of documents.

Since the number of content of each chunk can change, how could one upsert certain chunks? Must one label all chunks from a document with the same ā€œparent IDā€ and then delete all of those chunks? How else would one guarantee that all of the doc chunks have been properly upserted (or deleted) as the chunks change in subsequent rounds of chunking?

Any time :slight_smile:

For chunking, there isnā€™t a one recipe fits all.

But definitely, if you have a document that changed one part and your chunking strategy is not static, you need to delete the entire document and reindex it, or try updating only the affected chunks.

And depending on your use case it may be ok to have some overlap.

Anyway, have you seen this academy on that subject?

THanks!