How to ignore re-calculation of vector embeddings in existing records during batch import

lirantal · March 11, 2024, 7:55am

Description

My SDK is the Weaviate TypeScript client on a Node.js 21.7.1 runtime version.

I’m using the batch import logic (batcher.withObject() and then batcher.do()) to loop through a JSON list of a dataset and load them all into the vector database that calculates embeddings using the OpenAI vectorizer.

The data set always gets added with new entries, while keeping the old ones. Given that, I don’t want to always re-calculate the vector embeddings for records that I already imported in the past (waste of time and money). So, I used Weaviate’s generateUuid5 to generate a unique id entry for each object, assuming that when the batch import will try to load an existing entry it will ignore. However, that’s probably not the case.

I can tell, because it doesn’t throw an exception, and looking at the objects that get created I can see that each run of loading the data it has a new creationTimeUnix and lastUpdateTimeUnix time:

 {
    class: 'BlogPosts',
    creationTimeUnix: 1710143025495,
    id: '8881dee0-9ad3-5940-96c9-b00ebcdf487d',
    lastUpdateTimeUnix: 1710143025495,
    properties: {

Any ideas how do I achieve batch import while ignoring existing objects in the collection to avoid re-calculation of the embeddings?

Server Setup Information

I’m using the Weaviate cloud hosted service.

Any additional Information

malgamves · March 11, 2024, 2:57pm

hey @lirantal you could try generating deterministic IDs then your logic around duplicates should throw an error.

lirantal · March 11, 2024, 3:48pm

That’s what I’m doing… if you read my full post, I mention that I use Weaviate’s own generateUuid5 method to do that for the object that gets inserted to the batch.

DudaNogueira · March 11, 2024, 9:05pm

Hi!

When you provide the uuid, the batch will upsert. If the fields are the same, it shouldn’t re-vectorize your objects.

Consider the following code (in python )

import weaviate
import weaviate.classes as wvc
client = weaviate.connect_to_local()

client.collections.delete_all()
collection = client.collections.create(
    name="MyCollection",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
)

import requests
import json
from weaviate.util import generate_uuid5

fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"  # This file includes pre-generated vectors
url = f"https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}"
resp = requests.get(url)
data = json.loads(resp.text)  # Load data

question_objs = list()
with collection.batch.dynamic() as batch:
    for i, d in enumerate(data):
        batch.add_object(
            properties={
                "answer": d["Answer"],
                "question": d["Question"],
                "category": d["Category"] + "change",
            },
            uuid=generate_uuid5(i)
        )

this will only vectorize and set the creation_time date and last_udpate_time once.

query = collection.query.fetch_objects(
    return_metadata=wvc.query.MetadataQuery(creation_time=True, last_update_time=True)
)
for o in query.objects:
    print(o.uuid, o.metadata.creation_time, o.metadata.last_update_time)

Here the first run took 0.9, and the subsequent ones, 0.1

Now, if you change one of the fields, it will vectorize all objects again.

As you mentioned, sending in batch the objects that hasn’t changed is not optimal. The best would be to identify the objects that has changed, and send only those.

But this is not always possible.

Let me know if this helps

lirantal · March 12, 2024, 6:41am

I can confirm on my end. I think I was checking the collection creation date/time and not the objects specifically. Thanks!

Topic		Replies	Views
Avoid inserting dupes Support	2	794	February 10, 2024
Batch create objects with duplicate Support	1	760	January 29, 2024
Duplicate data detection in weaviate General	9	2311	March 10, 2025
No error during indexing yet aggregate is off Support	5	697	July 20, 2023
Issue with Duplicate UUID Handling in Weaviate Batch Import vs. Insert Method General	1	282	May 30, 2025

How to ignore re-calculation of vector embeddings in existing records during batch import

Description

Server Setup Information

Any additional Information

Related topics