Choosing a schema for Chunked documents

Let’s say I would like to create a vector of Books, each of which have multiple Chapters.

I would like to perform embedding similarity searches on entire books but also on individual chapters.

Would it be better to Create a single Book class that has a property of type object that contains an array of Chapters?

Or would it make more sense to Create both a Book and a Chapter class and have Books contain an array of refs to Chapters?

I am not sure what considerations I need.

Any help appreciated!

I believe you’d need to create a separate chapter class so that you can do vector search on it. That is what I’m doing with Document class and a DocumentParagraph class and it works well. I search on DocumentParagraph and pull metadata by traversing the cross-reference to Document

Hi @mjsteele,

In your case I would consider one of the following options

Option A - BookChunks

The simplest solution would be to introduce a bit of repetition.

So, you could divide you book into chunks of text (i.e. you could split your chunks per paragraph, or every x characters)

Then you could create a single collection that would contain:

  • title
  • book_id
  • chapter_number
  • text_chunk
    But only vectorize on text_chunk.

Then you could create a new object per chunk, and add it to the BookChunks collection. Which would later allow you to search across all chunks of text, and you could bring back the relevant results. And since each BookChunk object would contain the title and the chapter info, you could get all the necessary info in one place :wink:

If you would like to avoid multiple results coming back from per book/chapter, you could use group_by (see the docs for examples).

    .with_group_by(
        ["book_id", "chapter"],
        groups=5,
        objects_per_group=10
    )

The negative side to this approach is, that you would insert the same title many times, so that would take a bit more space. But in exchange, you would get a simpler solution to follow, and since you won’t be doing any reference joins, queries should be faster too.

Option B - Books + Chunks

The other approach is, what you proposed yourself. But I would skip chapters and go straight from books to text chunks.
Then your book object could contain all the metadata about the book, like a title and author, etc. While each chunk could have a reference to the parent book object.

This approach would save you from data repetition, but you would need to perform reference joins, which will make the solution more complex.

You might still get multiple results per book, so again you could use group_by to merge the results per book.

Summary

There might be other possible solutions, which might depend on your exact needs.
But I hope this helps you find the right path :slight_smile:

1 Like