Using LLM to create semantically relevant chunks

typefox09 · February 23, 2024, 11:36pm

Hi all.

I wanted to know if anyone has tried using a LLM to execute the chunking process? The data we’re processing doesn’t really work on a paragraph or sentence level as it’s more event driven, and each event can span a few paragraphs.

Does anyone have any feedback with this? Or a non LLM solution?

Cost aside, the only issue I can see is hallucinations, but what’s the likely hood of that when the context window is small (under a 1000 tokens), and the information is very clear?

DudaNogueira · February 25, 2024, 11:06am

Hi! Chunking is indeed a subject that doesnt have a one size fits all, as each content may require different approaches.

There are some chunking techniques that will take into consideration the semantics of the content, like keeping paragraphs together and keeping the title related to that paragrah along side the vectorized content.

Maybe, you could circunvent that by adding metadata to your chunk.

So if you have an event, that will be split into, let’s say 4 objects, and you have a way to keep some meta data info of that event, you could consider adding that info as metadata, that will be vectorized as part of the content.

This will render you vectors that is closer to that event and can yield better results.

I have never seen the usage of llms to sumarize or help in chunking. That could be interesting.

There is, however, a technique called generative feedback loops, that will index generated content back into the vector space, but that’s after the chunking.

Topic		Replies	Views
Additional Chunkers for Verba General	2	331	June 4, 2024
Late Chunking Support	9	632	September 23, 2024
Adding special symbols in the chunks and Does it affect the retrieval to pick up the right chunks? General technical	1	109	June 13, 2025
Fast start code sample and/or article on using a weaviate production cloud based cluster? Support wcs	4	473	April 18, 2024
Using an llm to improve semantic search Support	2	589	February 10, 2024

Using LLM to create semantically relevant chunks

Related topics