Size of split text leads to tensor size mismatch

nastdev · October 15, 2023, 11:07pm

I am currently using LangChain to split up the text of a PDF. Something like this:

    const textSplitter = new RecursiveCharacterTextSplitter({
      chunkSize: 1000,
      chunkOverlap: 200
    });

    return await textSplitter.splitDocuments(docs);

I try to create a new item in my local database, but I run into an error like this:

Error: usage error (500): {“error”:[{“message”:“update vector: fail with status 500: The size of tensor a (106) must match the size of tensor b (77) at non-singleton dimension 1”}]}

I was able to remove this error message by lowering the chunkSize and chunkOverlap into something like 250 and 25 respectively. I just wanted to know why this change removes the error as I am confused to why my chunking would create this type of issue.

DudaNogueira · October 16, 2023, 2:15pm

Hi @nastdev ! Welcome to our community

I believe this is a question more related to chunking techniques and langchain, so out of scope of this forum

However, having huge chunks is usually not a best practice, as your chunks will have too much meaning into them, leading to imprecise serach results.

Your error message probably comes from your inference model. It may have a limit on text vectorization.

There is a lot going on about chunking techniques. Check out this nice video from our own Erika Cardenas: https://www.youtube.com/watch?v=h5id4erwD4s&ab_channel=Weaviate•VectorDatabase

Let me know if this helps

Topic		Replies	Views
Facing maximum context length exceed issue during vectorizing Support python	1	277	April 16, 2024
Problems with vector (length) validation Support	4	414	July 1, 2024
An issue with the vectorizer module text2vec-transformers: 1024-dim vectors in 768-dim model Support	2	482	October 27, 2023
Using DataType.TEXT_ARRAY as a datatype for a feature/column causes problems with LangChain's Document and retrieval systems Support	2	53	January 26, 2025
Weaviate Hybrid Retriever issue in Langchain for custom vectors Support	3	1117	December 11, 2023

Size of split text leads to tensor size mismatch

Related topics