[Use-case] Design recommandations in large classes

vale · September 19, 2023, 10:54pm

Context

I have a large base of articles (50M+) and I want to implement a vector search on it

Each Article has 1 title 1 body

Each Article has a TypeA (with 5000 types. Each TypeA has 1 name and 1 description)
But a TypeA can be found in many articles

Each Article has a TypeB (with 2000 types. Each TypeB has 1 name and 1 description)
But a TypeB can be found in many articles

… (many other types)

Each Article has a country (with ~250 coutries. Each country has 1 name)

Problem

I would like to make a single query with only user input.

Example:

“nanotechnologies with TypeA job”

The query might look like this:

query {
    Get {
        Article (
            nearText: {
                concepts: ["nanotechnologies with TypeA job"] # user input
            },
            limit: 10
        ) {
            title
            body
            _additional {
                id
                distance
            }
        }
    }
}

Expected result:

A list of articles that talk about nanotechnologies and have a cross reference to TypeA with TypeA similar to “job”

Known solutions but hope there are more effective ones

Solution 1: Vectorize everything in Article

I know that a solution could be to vectorize all properties from Article and from relations directly in Article as following:

title
body
typeA name
typeA description
typeB name
typeB description
country name

But this requires vectorizing a LOT of redundant tokens (each description could contain between 200 and 300 tokens and vectorizing it 50M+ times instead of 5,000 is really inefficient)

Solution 2: Use filters

Another solution is to use filters to filter types separately. In the UI this will be reflected as one text field to search for “nanotechnologies” (as in the example above) and another text field to search for “work” (as in the example above) of type A, we can then use this output to filter the article search.

It works but it’s not the best possible user experience.

Solution 3: Chain request ?

I haven’t tried it but I think something that can be done is:

search in typeA for “nanotechnologies with TypeA job” with distance: 0.2 and get results
search in typeB for “nanotechnologies with TypeA job” with distance: 0.2 and get results
search in typeX for “nanotechnologies with TypeA job” with distance: 0.2 and get results
search in country for “nanotechnologies with TypeA job” with distance: 0.2 and get results
apply article search filters (if any) and search for “nanotechnologies with TypeA job”

In my opinion this should be the most acceptable solution, but I’m afraid that chaining together too many requests might take a bit too long and produce a bad user experience.

Not a Solution: centroid

I know centroid can be used to search with cross reference properties, but it does not support nearText search. Any tips on how to use it? Any use cases?

Open to discussion

If anyone with experience with Weaviate has an opinion on this use case, please feel free to respond on this topic.

DudaNogueira · September 28, 2023, 8:25pm

Hi @vale

Sorry for the delay here.

I asked internally, it was a little abstract to understand.

let me see If I got it right.

If you have 50M+ articles, and want to do a vector search on those metadata (Type, and country name, etc), you will need to vectorize them together.

You can skip the vectorization of that metadata, but then it will leave it’s meaning out of the vector, and you probably don’t want that.

Let me know if this clarifies.

Thanks!

vale · September 29, 2023, 8:02am

Hi @DudaNogueira,

I want to search within articles but in a more natural way for the end user without applying filters in the UI but instead applying natural language filters in user input.

For example:

“deal articles about technologies with company Acme with typeA xxxx, typeB yyyy, typeC zzzz”

In my opinion there are 3 main solutions as explained above and only 2 solutions for my use case

Solution 1

I know I can vectorize everything but it will end up with many duplicates vectors and wanted to know if better options are available.

In my opinion this solution is the best for performance because a single request is made (maybe costly but I can optimize that)

Solution 3

Chain multiple searches together to dynamically filter article but this one is perhaps the most complicated and time consuming.

What do you think of these 2 solutions? Do you have any Weaviate user using the solution 3?

Thanks for your help

P.S. I’m thinking about a way to optimize repetitive metadata vectorization. maybe if I codify each metadata value with a short code and then fine tune GPT model to explain that a code correspond to a large description?

That’s not something I’ll do for the moment but I’m just exploring the possibilities with Weaviate.

Topic		Replies	Views
Queries are taking long time to execute Support	5	625	September 13, 2023
Suggest best solution for schema with cross references Support	5	494	July 5, 2024
Choosing a schema for Chunked documents Support	2	642	November 2, 2023
Best way to vectorize and store a large document in Weaviate? General	6	1824	August 18, 2023
Weaviate FAQ Resources	1	1783	June 20, 2023