Errors: text too long for vectorization. Tokens for text: 10440, max tokens per batch: 8192, ApiKey absolute token limit: 1000000'

Muhammad_Ashir · October 25, 2024, 11:27am

Description

Hi there I am trying to generate some vector embeddings via Mistral AI but I am having few issues first of all I was getting 429 issue on object insertion i fixed it by limiting the number of request per seconds now I am getting this :

errors: text too long for vectorization. Tokens for text: 10440, max tokens per batch: 8192, ApiKey absolute token limit: 1000000’

client.collections.create(
    "Embeddings",
    vectorizer_config=[
        Configure.NamedVectors.text2vec_mistral(
            name="filecontent",
            source_properties=["filecontent"],
            model="mistral-embed"
        )
    ],
    # Additional parameters not shown
)

for row in rows:
      original_name = row.OriginalName
      full_text = row.FullText
    
    # Create object directly
      data_row = {
        "filename": original_name,
        "filecontent": full_text
      }
      print(f"Passing file : {original_name}")
      collection = client.collections.get("Embeddings")
      with collection.batch.rate_limit(requests_per_minute=30) as batch:
          obj_uuid = generate_uuid5(data_row)
          batch.add_object(
            properties=data_row  )
      
      if len(collection.batch.failed_objects) > 0:
          print(collection.batch.failed_objects)
      time.sleep(30)
    cursor.close()
    connection.close()

Any additional Information

DudaNogueira · October 25, 2024, 3:06pm

hi @Muhammad_Ashir !

Welcome to our community

What is the version you are using both for server and client?

Can you paste the entire traceback?

This message can happen if you pass a too big of content from a object to be indexed.

Muhammad_Ashir · October 28, 2024, 12:48pm

Hi @DudaNogueira I am using the cloud version Weaviate
Database version : 1.26.6
for server is there any solution how I can handle big content I am already using batch import and limited request with sleep time to handle the mistral api limits is there any way to limit the token

Muhammad_Ashir · October 28, 2024, 12:54pm

Passing file : AccountTransactionsf.pdf
{‘message’: ‘Failed to send 1 objects in a batch of 1. Please inspect client.batch.failed_objects or collection.batch.failed_objects for the failed objects.’}
[ErrorObject(message=“WeaviateInsertManyAllFailedError(‘Every object failed during insertion. Here is the set of all errors: text too long for vectorization. Tokens for text: 10440, max tokens per batch: 8192, ApiKey absolute token limit: 1000000’)”, object_=_BatchObject(collection=‘ECMEmbeddings’, vector=None, uuid=‘4e512aff-441d-4dbd-b31b-69f5c2e69aa1’, properties={‘filename’: ‘AccountTransactionsf.pdf’, ‘filecontent’: ‘my content is large here’ }, tenant=None, references=None, index=0, retry_count=0), original_uuid=None)]

Dirk · October 28, 2024, 1:02pm

Hey, How big is your input text? You can have a look here https://platform.openai.com/tokenizer

Muhammad_Ashir · October 28, 2024, 1:03pm

Hi @Dirk its
Tokens:8,230
Characters:18316

We do have some very large files expected to be having tokens more than 50,000

Dirk · October 28, 2024, 1:33pm

Mistral has a hard limit of 8192 tokens. You cannot have any texts that are larger that should be vectorized by mistral.

Have a look into chunking to work around that: A brief introduction to chunking | Weaviate

Muhammad_Ashir · October 31, 2024, 12:01pm

chunk_collection_definition = {
    "class": "DEmbeddings",
    "vectorizer": "text2vec-mistral",
    "moduleConfig": {
        "generative-mistral": {}
    },
    "properties": [
        {
            "name": "chunk",
            "dataType": ["text"],
        },
        {
            "name": "filename",
            "dataType": ["text"],
        },
        {
            "name": "chunking_strategy",
            "dataType": ["text"],
            "tokenization": "field",
        }
    ]
}


client.schema.create_class(chunk_collection_definition)

I have tried this but it says that not schema inside client one more thing right now I am using this

# client.collections.create(
#     "DEmbeddings",
#     vectorizer_config=[
#         Configure.NamedVectors.text2vec_mistral(
#             name="filecontent",
#             source_properties=["filecontent"],
#             model="mistral-embed",
#         )
#     ],
#     # Additional parameters not shown
# )

is there any way to define the chunking thing?

DudaNogueira · October 31, 2024, 12:56pm

hi @Muhammad_Ashir !

You need to chunk your content before ingesting to the Database.

Muhammad_Ashir · October 31, 2024, 1:00pm

I got you but I am concered about fetching because the techniques you have mentioned on documentation I am following that if I save it as chunk then in case of search I have to fetch the other chunks as well and I am just trying to follow that : Example part 1 - Chunking | Weaviate

DudaNogueira · October 31, 2024, 8:16pm

hi @Muhammad_Ashir !

Not sure I understood.

If you chunk your document, in let’s say, 3 chunks, you will get only the chunk that is closest to your query.

Muhammad_Ashir · November 1, 2024, 11:11am

Hi @DudaNogueira that’s not my case here What i want if I have three chunks and my search is nearest to the one of the chunks then i wanna get all those three chunks

We are working on a file base system so if we lets say have a large files we converted it into three chunks that is : a,b,c
Now if my query matches with b then I have to fetch the whole files to do some thing on that, and in your example even this thing has been explained that if you convert object into the chunks on match you will get the whole object but I am not sure about the tokenization because the method in documentation mentioned is not working for me as I have pasted code above as well.
I am following this Example part 1 - Chunking | Weaviate can you have look and let me know please

DudaNogueira · November 1, 2024, 12:50pm

Tokenization and Chunking are different things.

Tokenization is about how an object property will be tokenized to be indexed.

Let’s say you have a property url, that is set up to tokenization word (the default).

when you create an object with the value, for example, google.com Weaviate will tokenize this value per word. This means that you will end up with both google and com

Now when you do a filter in Weaviate, by property url EqualTo google.com you will not find that object, because it doesn’t have a google.com token, but google and com.

If you set the tokenization to field, then Weaviate will treat the whole value as a single token. Now you can search only for EqualTo google.com.

Now, chunking, is how you will separate a big corpus of text into smaller ones. Weaviate will not do that for your.

You need to chunk it before ingesting to the database. So instead of chunking 1 big corpus, you chunk it up in smaller ones, and each of that chunk will be an object in Weaviate.

Now, when you do a hybrid search, you will leverage both the vector and those tokens indexed as word and field, in order to get the best possible search.

In your case, you could for example group the results per different documents, and pass it over to the front end. If the user requests (or you can do it beforehand), you load up the surrounding chunks of that document so you can present it to the user.

I believe Verba does something similar: GitHub - weaviate/Verba: Retrieval Augmented Generation (RAG) chatbot powered by Weaviate

Let me know if this helps

Topic		Replies	Views
Error : text too long for vectorization Support python , technical	9	344	December 18, 2024
Facing maximum context length exceed issue during vectorizing Support python	1	381	April 16, 2024
Weaviate Openai Embedding Models General	8	410	August 23, 2024
Max file size for pdf imports & Connection Interruption Error Support bug , developer-experience , technical	1	182	November 13, 2024
Text2vec-openai Batch API Support integration , wcs , python	1	229	July 8, 2024

Errors: text too long for vectorization. Tokens for text: 10440, max tokens per batch: 8192, ApiKey absolute token limit: 1000000'

Description

Any additional Information

Related topics