Context
I have a large base of articles (50M+) and I want to implement a vector search on it
Each Article has 1 title 1 body
Each Article has a TypeA (with 5000 types. Each TypeA has 1 name and 1 description)
But a TypeA can be found in many articles
Each Article has a TypeB (with 2000 types. Each TypeB has 1 name and 1 description)
But a TypeB can be found in many articles
… (many other types)
Each Article has a country (with ~250 coutries. Each country has 1 name)
Problem
I would like to make a single query with only user input.
Example:
“nanotechnologies with TypeA job”
The query might look like this:
query {
Get {
Article (
nearText: {
concepts: ["nanotechnologies with TypeA job"] # user input
},
limit: 10
) {
title
body
_additional {
id
distance
}
}
}
}
Expected result:
A list of articles that talk about nanotechnologies and have a cross reference to TypeA with TypeA similar to “job”
Known solutions but hope there are more effective ones
Solution 1: Vectorize everything in Article
I know that a solution could be to vectorize all properties from Article and from relations directly in Article as following:
- title
- body
- typeA name
- typeA description
- typeB name
- typeB description
- country name
But this requires vectorizing a LOT of redundant tokens (each description could contain between 200 and 300 tokens and vectorizing it 50M+ times instead of 5,000 is really inefficient)
Solution 2: Use filters
Another solution is to use filters to filter types separately. In the UI this will be reflected as one text field to search for “nanotechnologies” (as in the example above) and another text field to search for “work” (as in the example above) of type A, we can then use this output to filter the article search.
It works but it’s not the best possible user experience.
Solution 3: Chain request ?
I haven’t tried it but I think something that can be done is:
- search in typeA for “nanotechnologies with TypeA job” with
distance: 0.2
and get results - search in typeB for “nanotechnologies with TypeA job” with
distance: 0.2
and get results - search in typeX for “nanotechnologies with TypeA job” with
distance: 0.2
and get results - search in country for “nanotechnologies with TypeA job” with
distance: 0.2
and get results - apply article search filters (if any) and search for “nanotechnologies with TypeA job”
In my opinion this should be the most acceptable solution, but I’m afraid that chaining together too many requests might take a bit too long and produce a bad user experience.
Not a Solution: centroid
I know centroid can be used to search with cross reference properties, but it does not support nearText search. Any tips on how to use it? Any use cases?
Open to discussion
If anyone with experience with Weaviate has an opinion on this use case, please feel free to respond on this topic.