Checking existence of many IDs

Description

I am using Weaviate together with Langchain as a vectorstore. When embedding new documents I have to check wether the documents are already added to Weaviate to avoid embedding the same document multiple times.

For this I am using the client.collections.get("MyCollection").data.exists("ID") function for EVERY document. This is a bottleneck for me, because I have to send a HEAD request for each individual document.

There are already functions for inserting and deleting many, a exists_many function that takes a list of IDs and returns a list of booleans while sending the HEAD requests in batches would be awesome. Or am I missing another simpler way of checking many IDs to avoid embedding documents that are already added?

hi @Habetuz !!

Welcome to our community :hugs:

You can do an “upsert” if you perform a batch with a deterministic id, on cases you can tie that object with a unique id:

Now when you import objects using batch and providing the id, it will update the object o create it:

from weaviate.util import generate_uuid5  # Generate a deterministic ID

data_rows = [{"title": f"Object {i+1}"} for i in range(5)]

collection = client.collections.get("MyCollection")

with collection.batch.dynamic() as batch:
    for data_row in data_rows:
        obj_uuid = generate_uuid5(data_row)
        batch.add_object(
            properties=data_row,
            uuid=obj_uuid
        )

If, despite the mentioned approach, you want to select multiple IDs at once, you can use containsany against the id property, like so:

response = collection.query.fetch_objects(
    filters=Filter.by_id().contains_any(ids),
    limit=10
)

Let me know if that helps!