Hi Weaviate Team,
A common use case for our team is inserting and updating many documents (>1K) simultaneously. I wanted to know if batch upsert functionality is offered on Weaviate or if another efficient method exists. Thank you,
Hi Weaviate Team,
A common use case for our team is inserting and updating many documents (>1K) simultaneously. I wanted to know if batch upsert functionality is offered on Weaviate or if another efficient method exists. Thank you,
hi @JK_Rider !!
You can do upserts using batch.
For that you need to use deterministic ids. You can generate those UUIDs from one of your ids. We have some examples here:
Here is a simple example:
import weaviate
client = weaviate.connect_to_local()
from weaviate.classes.query import Filter
from weaviate.util import generate_uuid5
client.collections.delete("Test")
collection = client.collections.create(name="Test")
objects = [
{"reference_id": 1, "content": "this is a first content"},
{"reference_id": 2, "content": "this is a second content"}
]
with collection.batch.dynamic() as batch:
for data_row in objects:
batch.add_object(
properties=data_row,
uuid=generate_uuid5(data_row.get("reference_id"))
)
for o in collection.query.fetch_objects().objects:
print(o.properties)
this will output
{āreference_idā: 2.0, ācontentā: āthis is a second contentā}
{āreference_idā: 1.0, ācontentā: āthis is a first contentā}
now we upsert
objects = [
{"reference_id": 1, "content": "this is NEW a first content"},
{"reference_id": 3, "content": "this is a third content"}
]
with collection.batch.dynamic() as batch:
for data_row in objects:
batch.add_object(
properties=data_row,
uuid=generate_uuid5(data_row.get("reference_id"))
)
for o in collection.query.fetch_objects().objects:
print(o.properties)
will output:
{ācontentā: āthis is a second contentā, āreference_idā: 2.0}
{āreference_idā: 3.0, ācontentā: āthis is a third contentā}
{āreference_idā: 1.0, ācontentā: āthis is NEW a first contentā}
Let me know if this helps!
Thanks!
Awesome, @DudaNogueira, I wasnāt sure if that automatically updated or if it would error. Thanks, for the confirmation.
@DudaNogueira using reference_id
to upsert seems to break down with adaptive chunking (e.g., via semchunk
).
Simply using a reference_id
with such chunking would not guarantee that the correct chunk would be updated, since the number and content of chunks could change between each chunking operation.
Maybe using hashlib (or similar) on the whole document is needed, and then delete + recreate all chunks of the doc if the whole-doc hash has changed?
hi @Nick_Youngblut !!
Welcome to our community
Thatās a common solution. Here is a code sample for that:
from weaviate.util import generate_uuid5
properties = {
"text": "this is a text",
"category": "cat1"
}
uuid = generate_uuid5(properties)
print(uuid)
So in the case you have changed properties values you can delete the old object and update it accordingly.
Let me know if that helps!
Thanks!
Thanks @DudaNogueira !
Iām still not seeing how an upsert workflow can feasibly work with adaptive chunking of documents.
Since the number of content of each chunk can change, how could one upsert certain chunks? Must one label all chunks from a document with the same āparent IDā and then delete all of those chunks? How else would one guarantee that all of the doc chunks have been properly upserted (or deleted) as the chunks change in subsequent rounds of chunking?
Any time
For chunking, there isnāt a one recipe fits all.
But definitely, if you have a document that changed one part and your chunking strategy is not static, you need to delete the entire document and reindex it, or try updating only the affected chunks.
And depending on your use case it may be ok to have some overlap.
Anyway, have you seen this academy on that subject?
THanks!