[Use-case] Design recommandations in large classes

Context

I have a large base of articles (50M+) and I want to implement a vector search on it

Each Article has 1 title 1 body

Each Article has a TypeA (with 5000 types. Each TypeA has 1 name and 1 description)
But a TypeA can be found in many articles

Each Article has a TypeB (with 2000 types. Each TypeB has 1 name and 1 description)
But a TypeB can be found in many articles

… (many other types)

Each Article has a country (with ~250 coutries. Each country has 1 name)

Problem

I would like to make a single query with only user input.

Example:

“nanotechnologies with TypeA job”

The query might look like this:

query {
    Get {
        Article (
            nearText: {
                concepts: ["nanotechnologies with TypeA job"] # user input
            },
            limit: 10
        ) {
            title
            body
            _additional {
                id
                distance
            }
        }
    }
}

Expected result:

A list of articles that talk about nanotechnologies and have a cross reference to TypeA with TypeA similar to “job”

Known solutions but hope there are more effective ones

Solution 1: Vectorize everything in Article

I know that a solution could be to vectorize all properties from Article and from relations directly in Article as following:

  • title
  • body
  • typeA name
  • typeA description
  • typeB name
  • typeB description
  • country name

But this requires vectorizing a LOT of redundant tokens (each description could contain between 200 and 300 tokens and vectorizing it 50M+ times instead of 5,000 is really inefficient)

Solution 2: Use filters

Another solution is to use filters to filter types separately. In the UI this will be reflected as one text field to search for “nanotechnologies” (as in the example above) and another text field to search for “work” (as in the example above) of type A, we can then use this output to filter the article search.

It works but it’s not the best possible user experience.

Solution 3: Chain request ?

I haven’t tried it but I think something that can be done is:

  1. search in typeA for “nanotechnologies with TypeA job” with distance: 0.2 and get results
  2. search in typeB for “nanotechnologies with TypeA job” with distance: 0.2 and get results
  3. search in typeX for “nanotechnologies with TypeA job” with distance: 0.2 and get results
  4. search in country for “nanotechnologies with TypeA job” with distance: 0.2 and get results
  5. apply article search filters (if any) and search for “nanotechnologies with TypeA job”

In my opinion this should be the most acceptable solution, but I’m afraid that chaining together too many requests might take a bit too long and produce a bad user experience.

Not a Solution: centroid

I know centroid can be used to search with cross reference properties, but it does not support nearText search. Any tips on how to use it? Any use cases?

Open to discussion

If anyone with experience with Weaviate has an opinion on this use case, please feel free to respond on this topic.

Hi @vale

Sorry for the delay here.

I asked internally, it was a little abstract to understand.

let me see If I got it right.

If you have 50M+ articles, and want to do a vector search on those metadata (Type, and country name, etc), you will need to vectorize them together.

You can skip the vectorization of that metadata, but then it will leave it’s meaning out of the vector, and you probably don’t want that.

Let me know if this clarifies.

Thanks!

Hi @DudaNogueira,

I want to search within articles but in a more natural way for the end user without applying filters in the UI but instead applying natural language filters in user input.

For example:

“deal articles about technologies with company Acme with typeA xxxx, typeB yyyy, typeC zzzz”

In my opinion there are 3 main solutions as explained above and only 2 solutions for my use case

Solution 1

I know I can vectorize everything but it will end up with many duplicates vectors and wanted to know if better options are available.

In my opinion this solution is the best for performance because a single request is made (maybe costly but I can optimize that)

Solution 3

Chain multiple searches together to dynamically filter article but this one is perhaps the most complicated and time consuming.

What do you think of these 2 solutions? Do you have any Weaviate user using the solution 3?

Thanks for your help :slight_smile:

P.S. I’m thinking about a way to optimize repetitive metadata vectorization. maybe if I codify each metadata value with a short code and then fine tune GPT model to explain that a code correspond to a large description?

That’s not something I’ll do for the moment but I’m just exploring the possibilities with Weaviate.