Query Across multiple classes

I have a use case where I might have required data distributed across multiple classes and I want to use this data from multiple classes in a RAG chain to generate answers for a query .

Assuming I have a limit of 8k Tokens for the LLM model used in RAG chain what are some good ways to get the top tokens combined from these classes (top k tokens across the classes and not individually).

I know of one methods I found :-

def fetch_and_combine_results(query: str, classes: list, per_class_limit: int = 4096) -> list:
 combined_results = [ ]

 for class_name in classes:
    response = (
        client.query
        .get(class_name, ["link", "scraped_text"])
        .with_hybrid(query, alpha=0.8)
        .with_limit(per_class_limit)
        .with_additional(["distance", "id"])
        .do()
    )

    if response and response['data']['Get'][class_name]:
        class_results = response['data']['Get'][class_name]
        combined_results.extend(class_results)

return combined_results



def sort_and_limit_results(results: list, token_limit: int = 4096) -> list:

results.sort(key=lambda x: x['_additional']['distance'])

limited_results = []
total_tokens = 0

for result in results:
    text = result['scraped_text']
    tokens = Total_tokens(text)

    if total_tokens + tokens > token_limit:
        break

    limited_results.append(result)
    total_tokens += tokens

return limited_results

Are there any other better ways to do this ?
Thank you for your help!!

hi @Ansh_Gaur !

Welcome to our community :hugs:

Unless you use cross references (and maybe ref2vec?) and some modelling, you will not be able to query the two collections at once and get a single score/distance.

You will, off course, be able to get two+ separate queries, as you are doing.

Now, you need to keep in mind that, when doing a hybrid search, the fusion algorithm will kick in.

So you are not resorting the two queries results by cosine vector distance, but by a normalized score ranking from the first result on each query (the score of the first object in each query is 1), as explained in that blog post.

Maybe the distance (or doing a nearText with outcut instead of hybrid) can get you better different sorting. Or not hehehe

So something to explore :slight_smile:

Also, I see you are using the v3 client. We strongly suggest using the python v4 client.

With that said… :wink: there is a way of getting the two results with only one http query, by using graphql raw queries :slight_smile:

Check here for more on that:

Let me know if that helps :slight_smile: