Return "unique file" when search large documents

I’m exploring Weaviate to provide semantic search to find content of attachments in an enterprise application.

I currently chop up each attachment file (usually have a few dozen or a few hundred pages) into chunks. The chunk record would have other meta info like page number, file name, owner application and record id.

Currently when querying using “with_near_text”, it usually returns multiple different chunks from the same file.

I’d like it to return one chunk per document (file name) or one chunk per unique application / record Id pair.

I’m wondering is it possible?

After going through the documentation, my solution is to have two Classes:

  • DocumentChunk - properties: text, pageNumber, document (reference to Document class)
  • Document - properties: name, summary, ownerId, ownerTable, appName

Query:

{
  Get {
    DocumentChunk (
      nearText: {
        concepts: ["overpressure"]
      },
      groupBy: {
        path: ["document"]
        groups: 3,
        objectsPerGroup: 2
      },
    ) {
      text
      pageNumber
      document {
        ...on Document {
          name
        }
      }
    }
  }
}
1 Like

Hi @Viet_Tran - yeah I think grouping by the parent document makes sense.

Did you know you can group by the cross-referenced property?

So, depending on what your x-ref is called, you can use replace “document” here with the cross-referenced property.