How to ignore re-calculation of vector embeddings in existing records during batch import

Description

My SDK is the Weaviate TypeScript client on a Node.js 21.7.1 runtime version.

I’m using the batch import logic (batcher.withObject() and then batcher.do()) to loop through a JSON list of a dataset and load them all into the vector database that calculates embeddings using the OpenAI vectorizer.

The data set always gets added with new entries, while keeping the old ones. Given that, I don’t want to always re-calculate the vector embeddings for records that I already imported in the past (waste of time and money). So, I used Weaviate’s generateUuid5 to generate a unique id entry for each object, assuming that when the batch import will try to load an existing entry it will ignore. However, that’s probably not the case.

I can tell, because it doesn’t throw an exception, and looking at the objects that get created I can see that each run of loading the data it has a new creationTimeUnix and lastUpdateTimeUnix time:

 {
    class: 'BlogPosts',
    creationTimeUnix: 1710143025495,
    id: '8881dee0-9ad3-5940-96c9-b00ebcdf487d',
    lastUpdateTimeUnix: 1710143025495,
    properties: {

Any ideas how do I achieve batch import while ignoring existing objects in the collection to avoid re-calculation of the embeddings?

Server Setup Information

I’m using the Weaviate cloud hosted service.

Any additional Information

1 Like

hey @lirantal you could try generating deterministic IDs then your logic around duplicates should throw an error.

That’s what I’m doing… if you read my full post, I mention that I use Weaviate’s own generateUuid5 method to do that for the object that gets inserted to the batch.

Hi!

When you provide the uuid, the batch will upsert. If the fields are the same, it shouldn’t re-vectorize your objects.

Consider the following code (in python :grimacing: )

import weaviate
import weaviate.classes as wvc
client = weaviate.connect_to_local()

client.collections.delete_all()
collection = client.collections.create(
    name="MyCollection",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
)

import requests
import json
from weaviate.util import generate_uuid5

fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"  # This file includes pre-generated vectors
url = f"https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}"
resp = requests.get(url)
data = json.loads(resp.text)  # Load data

question_objs = list()
with collection.batch.dynamic() as batch:
    for i, d in enumerate(data):
        batch.add_object(
            properties={
                "answer": d["Answer"],
                "question": d["Question"],
                "category": d["Category"] + "change",
            },
            uuid=generate_uuid5(i)
        )

this will only vectorize and set the creation_time date and last_udpate_time once.

query = collection.query.fetch_objects(
    return_metadata=wvc.query.MetadataQuery(creation_time=True, last_update_time=True)
)
for o in query.objects:
    print(o.uuid, o.metadata.creation_time, o.metadata.last_update_time)

Here the first run took 0.9, and the subsequent ones, 0.1

Now, if you change one of the fields, it will vectorize all objects again.

As you mentioned, sending in batch the objects that hasn’t changed is not optimal. The best would be to identify the objects that has changed, and send only those.

But this is not always possible.

Let me know if this helps :slight_smile:

1 Like

I can confirm on my end. I think I was checking the collection creation date/time and not the objects specifically. Thanks!