Best strategy for deduplicating vectors
I am building a database of costs of medical procedures.
Sometimes there are multiple items with the same text string e.g. “blood draw” and therefore the same embedding.
E.g. I might have an entry for “blood draw” in ten separate locations.
I am considering two architectures:
- 
Cost datapoints are grouped together by vectors in Weaviate. Information about the different costs for different locations would be stored inside some kind of structured field or JSON. So “blood draw” cost data in CA, FL and NY would all be in the same record in Weaviate. 
- 
Every cost data point is a separate item in Weaviate, meaning that there is a separate record for “blood draw” in CA, “blood draw” in FL, etc. 
The first approach sounds like it would keep the vector index smaller. I don’t know if it would be faster?
However, deduplicating data in the way I described could get complex in terms of inserting new data, where sometimes a new record must be created and sometimes an existing record should be modified.
Can someone please clarify which of the two approaches is best practice in the case of multiple records with exactly the same value for the vector embedding?
