guid = record_dict.get("GUID", "")
logger.info(f"GUID {guid}")
obj_uuid = uuid.uuid5(namespace_uuid, guid)
logger.info(f"UUID{obj_uuid}")
collection = self.client.collections.get(self.collection_name)
exists = collection.data.exists(obj_uuid)
logging.info(f"Existence check for {obj_uuid}: {exists}")
collection.data.insert(
properties={
"GUID": [guid],
"a": "",
"b": ""
"c":"",
"d":""
},
uuid=obj_uuid
)
records_processed += 1
logger.info(f"Inserted record (GUID: {guid}, UUID: {obj_uuid})")
except weaviate.exceptions.UnexpectedStatusCodeError as e:
if e.status_code == 409:
skip_reason = "Duplicate GUID found (race condition)"
else:
skip_reason = f"Weaviate error: {str(e)}"`Preformatted text`
I am trying to implement above code.
Where I want to insert the new records in existing collection and avoid duplicate data if it is already present in a collection that i have already created.
When I run the application 1st time then it is not able to detect the duplicate but if I again run it with same GUID then it detect it as a duplicate GUID.
Is this only checking the duplicates for the records that I am inserting newly, and will not check duplicate with older data that is alreday ther in collection.
or am i missing something ?
Note: I am using weaviate lib version: 0.1.2, weaviate client version: 4.11.1 and weaviate server version: 1.27.8
Please clarify doubts.
hi @Rohini_vaidya !!
The recommended way to avoid this situation is using a deterministic id, meaning that instead of letting Weaviate create a random uuid, you will provide a UUID based on a unique ID for your objects.
When using colection.data.insert
and you have an existing object with the uuid you are inserting, Weaviate will not update the object, but complain about an existing object.
On the other hand, if you use batch import, and provide an uuid, Weaviate will create or update that object with the given object.
In our recipes repo, we have a nice recipe on how to implement a nice batch with retry logic:
Let me know if this helps!
THanks!
Thank you @DudaNogueira
Can I use the batch import method in Weaviate to add data to a collection while ensuring duplicate records are identified using UUID?
Currently, I am using the data.insert
method, which correctly detects duplicates based on the generated UUID. However, when inserting a large number of records, this approach becomes slow.
Would the batch import method help improve performance while still preventing duplicate entries?
Hi, I have a doubt regarding generating uuid for text_array.
I have list of GUIDs [โ123โ,โ456โ]
It should generate different deterministic uuid for both GUIDs that is GUID 123 will have uuid1 and GUID 456 will have uuid2.
currently it is combining both GUIDs and generating uuid for [โ123456โ]
pl let me know the efficient solution for this.
hi @Rohini_vaidya !!
data.insert
will fail when the uuid already exists. So if you have a unique content to pass to generate_uuid5, you can always generate the same uuid.
Here an example:
from weaviate.util import generate_uuid5
print(["123", "456"], generate_uuid5(["123", "456"]))
print(["456", "123"], generate_uuid5(["456", "123"]))
print("123456", generate_uuid5("123456"))
print(["123456"], generate_uuid5(["123456"]))
print("456123", generate_uuid5("456123"))
this will print
['123', '456'] 9f5bdeb4-dc32-5f94-9689-177bf744c134
['456', '123'] d98d6139-cead-5428-8897-6cf46e496aef
123456 a52b2702-9bcf-5701-852a-2f4edc640fe1
['123456'] 14f87c42-4614-504a-88a2-6a10ff4fa6e7
456123 3b8fcdc1-2b41-514a-9a5c-2562dd5813ae
Thank you @DudaNogueira
so, I need to generate deterministic uuid for both elements in a list, I need to access each element separately and generate uuid for individual element of list then it will give me deterministic uuid for both elements.
for example:
To get uuid for โ123โ and โ456โ
list = [โ123โ,โ456โ]
for g in list:
generate_uuid5(g)
Let me know if I am wrong here.
Thank you for your quick response.
Hi!
With that code you will endup with 2 uuids.
This is what you are looking for, to generate one single uuid based on a list of different ids and making sure to sort the list before:
from weaviate.util import generate_uuid5
my_ids = ["123", "456"]
my_ids_different_order = ["456", "123"]
print(generate_uuid5(sorted(my_ids)))
print(generate_uuid5(sorted(my_ids_different_order)))
both lists will endup generating the uuid:
9f5bdeb4-dc32-5f94-9689-177bf744c134
Let me know if this helps!
Thanks!
Hi @Rohini_vaidya,
This is an excellent post 
In my workshops, this is one of the topics I teach about.
So, like Duda suggested, using generate_uuid5
you can generate a new UUID, and if you run batch insert, Weaviate will automatically ignore any objects with the same UUID 
Here is an example from my workshop notebook:
The first time I run this, I get 100 unique objects, if I run it again, no new objects get added 
from tqdm import tqdm
from weaviate.util import generate_uuid5
sample_100 = data_2k[0:100]
wiki = client.collections.get("Wiki")
with wiki.batch.fixed_size(batch_size=20, concurrent_requests=2) as batch:
for item in tqdm(sample_100):
id = generate_uuid5(item["wiki_id"])
batch.add_object(
item,
uuid=id
)
print(f"Wiki count: {len(wiki)}")
Hi @Rohini_vaidya
This is a great explanation. Using generate uuid5
for batch inserts is an efficient approach to avoid duplicate objects in weaviate. The example from your workshop makes it clear how Weaviate automatically handles existing UUIDs, ensuring only unique objects are added. Thanks for sharing this insightful method.
2 Likes
Thank you @sebawita
I have implemented this using insert method but that makes my insertion of data quite slow.
I will definitely try with batch import.
1 Like