Size of split text leads to tensor size mismatch

I am currently using LangChain to split up the text of a PDF. Something like this:

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200
    });

    return await textSplitter.splitDocuments(docs);

I try to create a new item in my local database, but I run into an error like this:

Error: usage error (500): {“error”:[{“message”:“update vector: fail with status 500: The size of tensor a (106) must match the size of tensor b (77) at non-singleton dimension 1”}]}

I was able to remove this error message by lowering the chunkSize and chunkOverlap into something like 250 and 25 respectively. I just wanted to know why this change removes the error as I am confused to why my chunking would create this type of issue.

Hi @nastdev ! Welcome to our community :hugs:

I believe this is a question more related to chunking techniques and langchain, so out of scope of this forum :frowning:

However, having huge chunks is usually not a best practice, as your chunks will have too much meaning into them, leading to imprecise serach results.

Your error message probably comes from your inference model. It may have a limit on text vectorization.

There is a lot going on about chunking techniques. Check out this nice video from our own Erika Cardenas: https://www.youtube.com/watch?v=h5id4erwD4s&ab_channel=Weaviate•VectorDatabase

Let me know if this helps :slight_smile: