Description
I am using Weaviate together with Langchain as a vectorstore. When embedding new documents I have to check wether the documents are already added to Weaviate to avoid embedding the same document multiple times.
For this I am using the client.collections.get("MyCollection").data.exists("ID")
function for EVERY document. This is a bottleneck for me, because I have to send a HEAD
request for each individual document.
There are already functions for inserting and deleting many, a exists_many
function that takes a list of IDs and returns a list of booleans while sending the HEAD
requests in batches would be awesome. Or am I missing another simpler way of checking many IDs to avoid embedding documents that are already added?
hi @Habetuz !!
Welcome to our community 
You can do an “upsert” if you perform a batch with a deterministic id, on cases you can tie that object with a unique id:
Now when you import objects using batch and providing the id, it will update the object o create it:
from weaviate.util import generate_uuid5 # Generate a deterministic ID
data_rows = [{"title": f"Object {i+1}"} for i in range(5)]
collection = client.collections.get("MyCollection")
with collection.batch.dynamic() as batch:
for data_row in data_rows:
obj_uuid = generate_uuid5(data_row)
batch.add_object(
properties=data_row,
uuid=obj_uuid
)
If, despite the mentioned approach, you want to select multiple IDs at once, you can use containsany against the id
property, like so:
response = collection.query.fetch_objects(
filters=Filter.by_id().contains_any(ids),
limit=10
)
Let me know if that helps!