Best practice where multiple records have exactly the same vector - deduplicate or keep separate?

Best strategy for deduplicating vectors

I am building a database of costs of medical procedures.
Sometimes there are multiple items with the same text string e.g. “blood draw” and therefore the same embedding.

E.g. I might have an entry for “blood draw” in ten separate locations.

I am considering two architectures:

  1. Cost datapoints are grouped together by vectors in Weaviate. Information about the different costs for different locations would be stored inside some kind of structured field or JSON. So “blood draw” cost data in CA, FL and NY would all be in the same record in Weaviate.

  2. Every cost data point is a separate item in Weaviate, meaning that there is a separate record for “blood draw” in CA, “blood draw” in FL, etc.

The first approach sounds like it would keep the vector index smaller. I don’t know if it would be faster?

However, deduplicating data in the way I described could get complex in terms of inserting new data, where sometimes a new record must be created and sometimes an existing record should be modified.

Can someone please clarify which of the two approaches is best practice in the case of multiple records with exactly the same value for the vector embedding?

Hey @fastdatascience,

Generally speaking and as a rule of thumb, the smaller the units, the more accurate the search will be… At the same time more objects leads to a higher import time and (since each vector also makes up some data) more space.

Avoid storing many objects with identical vectors, especially with HNSW, to prevent performance and memory issues.

I would go with 1 rather than 2.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00 / +01:00)

1 Like

Thank you Mohamed, this is really helpful. I suspected as much. I will have to write some complex logic to handle updates but it will be doable. I will take approach #1 as you suggest.

1 Like