Best practice where multiple records have exactly the same vector - deduplicate or keep separate?

fastdatascience · October 9, 2025, 12:59pm

Best strategy for deduplicating vectors

I am building a database of costs of medical procedures.
Sometimes there are multiple items with the same text string e.g. “blood draw” and therefore the same embedding.

E.g. I might have an entry for “blood draw” in ten separate locations.

I am considering two architectures:

Cost datapoints are grouped together by vectors in Weaviate. Information about the different costs for different locations would be stored inside some kind of structured field or JSON. So “blood draw” cost data in CA, FL and NY would all be in the same record in Weaviate.
Every cost data point is a separate item in Weaviate, meaning that there is a separate record for “blood draw” in CA, “blood draw” in FL, etc.

The first approach sounds like it would keep the vector index smaller. I don’t know if it would be faster?

However, deduplicating data in the way I described could get complex in terms of inserting new data, where sometimes a new record must be created and sometimes an existing record should be modified.

Can someone please clarify which of the two approaches is best practice in the case of multiple records with exactly the same value for the vector embedding?

Mohamed_Shahin · October 10, 2025, 8:49am

Hey @fastdatascience,

Generally speaking and as a rule of thumb, the smaller the units, the more accurate the search will be… At the same time more objects leads to a higher import time and (since each vector also makes up some data) more space.

Avoid storing many objects with identical vectors, especially with HNSW, to prevent performance and memory issues.

I would go with 1 rather than 2.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00 / +01:00)

fastdatascience · October 10, 2025, 9:05am

Thank you Mohamed, this is really helpful. I suspected as much. I will have to write some complex logic to handle updates but it will be doable. I will take approach #1 as you suggest.

Topic		Replies	Views
Have multiple vectors for a single object in the same index? Support	5	734	April 29, 2025
Adding removing objects with the same vectors Support	2	412	May 24, 2024
How to ignore re-calculation of vector embeddings in existing records during batch import Support	4	695	March 12, 2024
Best way to Vectorize Multiple Fields General	1	436	April 17, 2025
Avoid inserting dupes Support	2	848	February 10, 2024

Best practice where multiple records have exactly the same vector - deduplicate or keep separate?

Hey @fastdatascience,

Related topics