Why am I getting a malformed vector error when trying to add text metadata?

Ken_Tola · June 30, 2024, 5:53pm

Good Afternoon -

I have a local docker run weaviate instance running with auto schema set to OFF (FALSE) and the following two properties on a specific collection.

Property(name='topic_id', description=None, data_type=<DataType.TEXT: 'text'>, index_filterable=True, index_searchable=True, nested_properties=None, tokenization=<Tokenization.WORD: 'word'>, vectorizer_config=_PropertyVectorizerConfig(skip=False, vectorize_property_name=False), vectorizer='text2vec-huggingface')
Property(name='date_added', description=None, data_type=<DataType.DATE: 'date'>, index_filterable=True, index_searchable=False, nested_properties=None, tokenization=None, vectorizer_config=_PropertyVectorizerConfig(skip=False, vectorize_property_name=True), vectorizer='text2vec-huggingface')

When I attempt to enter data into these fields via my Python code,

datetime_now = datetime.now()
rfcc = datetime_now.strftime("%Y-%m-%dT%H:%M:%S+00:00")
object_id = user_collection.data.insert(properties={code_id: [TEXT], "date_added": rfcc}, vector=json_object)

I keep getting the following error:

weaviate.exceptions.WeaviateInvalidInputError: Invalid input provided: The vector you supplied was malformatted! Vector: [TEXT]

Where [TEXT] is the text I am trying to enter.

What am I doing wrong?

Thank you!

Mohamed_Shahin · June 30, 2024, 7:21pm

Hi @Ken_Tola,

The Vector field is a list of numerical values which should be a list of numerical values representing the embeddings.

The problem is that json_object (which contains text) is being inserted into the vector field. The vector field should contain embeddings.

See this example: Bring your own vectors | Weaviate - Vector Database.

Furthermore, check insert function in the client: Insert Function Definition

I hope this helps. Have a lovely week!

Ken_Tola · June 30, 2024, 8:28pm

Thank you for your response! Are you saying that I need to create my embeddings prior to submitting everything?

I do not understand how the vector is being pulled for the data in the example provided.

I have this code but it is just chunking the JSON - is that sufficient or do I literally need to create the embeddings even though I have weaviate setup to do the embeddings for me?

json_splitter = RecursiveJsonSplitter(max_chunk_size=2000)
json_docs = json_splitter.split_json(json_object, True)

When I ran the code with this added, I got the error

struct.error: required argument is not a float

Ken_Tola · June 30, 2024, 8:35pm

Here is the overall method right now.

def add_to_vectorstore(database_name: str, collection_name: str, json_object: json, state, property_field_name: str,
                       property_value_type: weaviate_datatypes, property_field_value: str, object_ids: [uuid] = None) -> [int, uuid, list]:
    """
    This method stores a JSON document in the collection provided

    Parameters
    ----------
    database_name : str
        Name of the databaseShould be set to the user token retrieved from MongDB during login
    collection_name: str
        Name of the collection to which this attachment should be added
    json_object: json
        The document to be stored - must be valid JSON
    state: TypeDict
        Current state object
    property_field_name: str
        The name of the field to add
    property_value_type: weaviate_datatypes
        The value type of the field to add
    property_field_value: str
        The value of the metadata field to add
    object_ids: [int, uuid]
        If this is an update then provide original UUIDs

    Returns
    -------
    list
        In the format [int, UUID]
    uuid
        Unique ID of the document added/updated
    list
        Current state object

    """
    diff_count = 0
    weaviate_client = weaviate.connect_to_local()
    internal_name = clean_field_name(database_name + collection_name)
    try:
        if not weaviate_client.collections.exists(internal_name):
            state["persistent_logs"].append("Creating the collection " + internal_name)
            logging.debug("Creating the collection " + internal_name)
            user_collection = weaviate_client.collections.create(
                name=internal_name,
                vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_huggingface(),
                properties=[
                    Property(name=property_field_name, data_type=get_weaviate_type(property_value_type), vectorize_property_name=False),
                    Property(name="date_added", data_type=get_weaviate_type(weaviate_datatypes.DateTime), vectorize_property_name=True),
                ],
                # Configure the vector index
                vector_index_config=wvc.config.Configure.VectorIndex.hnsw(  # Or `flat` or `dynamic`
                    distance_metric=wvc.config.VectorDistances.COSINE,
                    quantizer=wvc.config.Configure.VectorIndex.Quantizer.bq(),
                ),
                # Configure the inverted index
                inverted_index_config=wvc.config.Configure.inverted_index(
                    index_null_state=True,
                    index_property_length=True,
                    index_timestamps=True,
                )

            )
            initial_count = 0
        else:
            state["persistent_logs"].append("Collection already exists.")
            logging.debug("Collection already exists.")
            user_collection = weaviate_client.collections.get(internal_name)
            aggregation = user_collection.aggregate.over_all(total_count=True)
            initial_count = aggregation.total_count

        datetime_now = datetime.now()
        rfcc = datetime_now.strftime("%Y-%m-%dT%H:%M:%S+00:00")
        json_splitter = RecursiveJsonSplitter(max_chunk_size=2000)
        json_docs = json_splitter.split_json(json_object, True)

        if object_ids is None:
            doc_objs = list()
            for doc in json_docs:
                doc_objs.append(wvc.data.DataObject(
                    properties={property_field_name: property_field_value, "date_added": rfcc},
                    vector=doc
                ))
            state["persistent_logs"].append("Adding document to Weaviate")
            logging.debug("Adding document to Weaviate")
            object_ids = user_collection.data.insert_many(doc_objs)
        else:
            state["persistent_logs"].append("Updating document in Weaviate")
            logging.debug("Updating document in Weaviate")
            if len(object_ids) != len(json_docs):
                if len(object_ids) > len(json_docs):
                    while len(object_ids) > len(json_docs):
                        this_obj = object_ids.pop()
                        remove_by_id(database_name, collection_name, this_obj, state)
                else:
                    current_count = len(object_ids)
                    while len(object_ids) < len(json_docs):
                        object_ids.append(current_count, uuid.uuid4())
                        current_count += 1
            count = 0
            while count < len(object_ids):
                user_collection.data.replace(uuid=object_ids[count], properties={property_field_name: property_field_value, "date_added": rfcc}, vector=json_docs[count])

        aggregation = user_collection.aggregate.over_all(total_count=True)
        final_count = aggregation.total_count
        diff_count = final_count - initial_count
        state["persistent_logs"].append("Started with " + str(initial_count) + " documents, now have " + str(final_count) + " documents")
        logging.debug("Started with " + str(initial_count) + " documents, now have " + str(final_count) + " documents")
    except:
        trace_back = traceback.format_exc()
        logging.error("An unexpected error occurred attempting to add document to Weaviate collection: " + internal_name +
                      "\nHere is the document that failed: " + write_object_to_prompt(json_object) + " \nWith the error:\n " + trace_back)
        state["persistent_logs"].append(
            "An unexpected error occurred attempting to add document to Weaviate Collection: " + internal_name +
            "\nHere is the document that failed: " + write_object_to_prompt(json_object) + " \nWith the error:\n " + trace_back)
    finally:
        weaviate_client.close()
    return diff_count, object_ids, state

And the test call that fails:

state = {
    "errors": "",
    "persistent_logs": [],
}
        
test_topic = {
    "topic_id": str(uuid.uuid4()), 
    "topic_name": "some random topic", 
    "topic_summary": " a test summary",
    "conversations": [{
        "conversation": {
            "conversation_id": str(uuid.uuid4()),
            "converation_sender": "email@noreply.com",
            "conversation_text": "Some long boring conversation..."
        }
}]}
add_number, doc_ids, state = add_to_vectorstore("user-x", "topics", test_topic, state, "topic_id",
                                               weaviate_datatypes.Text, test_topic["topic_id"])

DudaNogueira · July 1, 2024, 1:02pm

Hi @Ken_Tola !!

If you want to let Weaviate vectorize your object for you (I see you have defined text2vec_huggingface() for vectorizer) you shouldn’t provide the vector while inserting, here:

                doc_objs.append(wvc.data.DataObject(
                    properties={property_field_name: property_field_value, "date_added": rfcc},
                    # vector=doc <---- you are providing the vector here.
                ))

The error you are seeing is probably raised because you are passing something different from a vector as doc.

Now, if you want to provide the vectors yourself, you need to make sure that the vector parameter is a vector, something like [1, 2, 3, 4], so a list of floats that you get from the vectorizer yourself.

I believe you don’t want to “Bring your own vectors”, so you can comment out that part of the code. This will trigger Weaviate to vectorize your objects for you while ingesting

Let us know if this helps!

Thanks!

Ken_Tola · July 1, 2024, 5:50pm

Thank you so much for your help! I am close but I am now running into window size issues - doesn’t weaviate handle chunking or do I have to do that on my own? Here is the error:

This model's maximum context length is 8192 tokens, however you requested 18140 tokens (18140 in your prompt; 0 for the completion). Please reduce your prompt; or completion length.

DudaNogueira · July 1, 2024, 6:18pm

Hi!

This message usually comes directly from the vectorizer service (probably there is something around this log that message that will point to the vectorizer).

It indicates that you are passing too much context to your generative integration.

Weaviate doesn’t handle chunking. It will store the chunks you generate, and take care of their vectorization.

You can reduce the limit parameter of your generative query or reduce the chunk size. Either way, you need to control the size of this context window you are passing for the generative phase.

If you want to learn more on chunking, check here:

Let me know if this helps!

Thanks!

Topic		Replies	Views
Message': "module 'text2vec-transformers': invalid combination of properties Support	4	22	September 16, 2024
No vector found after configuring vectorizer! Support	1	95	July 7, 2024
Problems with vector (length) validation Support	4	356	July 1, 2024
Help Needed: Resolving WeaviateQueryError with Nil or Zero-Length Vector at docID 715 Support	18	580	May 11, 2024
Near_text() with my own vectorizor Support python	5	124	June 21, 2024

Why am I getting a malformed vector error when trying to add text metadata?

Related topics