How to get unique results based on references

Spun · March 8, 2024, 7:41pm

Description

Hi all! I have a question how to create search query in a correct way. I have two classes in schema:

Meta data for the document
Document {
docname: …,
author: …,
…
}
chunks for the documents with reference:
Chunk {
textBody: …,
refToDocument: reference to the Document object
}
So, I’m searching through the chunks well and getting good results, but… the problem happens when I want to get results for 30 documents - because I’m searching in the chunks - ‘limit’ field is useless here. Only one way I see - do the search with max limit, then “page” results manualy and divide it by 30 documents blocks. It’s some ugly and potentialy resource unoptimized solution. Is there another solution exists by Weaviate stuff?

Server Setup Information

Weaviate Server Version: 1.24.1
Deployment Method: docker
Multi Node? Number of Running Nodes: 1
Client Language and Version: Python, 4.5.1

DudaNogueira · March 8, 2024, 8:06pm

Hi @Spun ! Welcome to our community!

Let me know if I understood it correctly.

Do yo want to search on the chunks of specific 30 documents you have?

If that’s the case, you can search the chunks and filter using it’s cross references.

You will get all the chunks, specifically from those 30 documents, that are the closest to your query.

Let me know if this helps

Spun · March 8, 2024, 9:22pm

No, I have another scenario. For example I have 100K documents, each has 10 chunks(so, total 1M chunks). Search “animals” - got 100K results - 100K chunk texts(could be lesser if chunk contains two or more different sentences about animals) which contains something interesting about animals - but there would be few chunks reffer to doc1, one chunk refers to doc2, 10 chunks refer to doc3 and so on(I don’t know exact number of documents - could be 10K or 40K or 60K). And I need only doc1, …, doc30 (not all the 10K or more documents) from the results(so I need chunks that refers to the first 30 unique document objects only, and of course I can’t predict which documents should be there). Then I will need doc31,…,doc60. But in my case ‘limit’ and ‘offset’ options work with chunks only not with documents. That’s why I’m looking for solution similar to one I could do in SQL DBs.
As I wrote previously I see only one solution - get ALL(there could be hundreds of thousands) results from the search and parse them by Python(or any other client) stuff to create “cache” for the “animals” and work with that cache until search request will change. As for me it’s a bad solution(

Spun · March 8, 2024, 9:33pm

Hm, thinking little more and seeing another solution:

//pseudo code
limit=30
offset = 0
doc_num=0
docs = []
while (doc_num < 30):
  results = search_animals_in_chunks(limit, offset) #weaviate search inside
  doc_num += add_unique_documents_from_chunks(results, docs) #parse referrences here
  offset++

What do you think about this solution?

Spun · March 9, 2024, 7:09am

So, I’ve also could upgrade previous code with after parameter instead of offset. Looks like it more suitable for my needs. But seems like it doesn’t work with query searches, needs to check.

Spun · March 9, 2024, 8:25am

Sad, but after and sort doesn’t work with query searches(bm25, context, hybrid)

Spun · March 9, 2024, 8:30am

Hi @DudaNogueira ! Seems like I’ve created the topic in the wrong forum section. Could you please move it to General?
Thanks )

Topic		Replies	Views
Return "unique file" when search large documents General	2	789	June 12, 2023
How to use reference for filter effciently? Support	6	587	March 14, 2024
Not Getting complete results Support	1	772	June 23, 2023
Long text, chunking, top document , aggregate results Support developer-experience	1	576	November 23, 2023
Update existing chunks in a document with more than QUERY_MAXIMUM_RESULTS entries Support	10	676	November 12, 2024

How to get unique results based on references

Description

Server Setup Information

Related topics