Using DataType.TEXT_ARRAY as a datatype for a feature/column causes problems with LangChain's Document and retrieval systems

kumaran14 · January 12, 2025, 8:07pm

Description


import weaviate
import os
import weaviate.classes as wvc
import weaviate.classes.config as wc
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langchain_weaviate.vectorstores import WeaviateVectorStore

weaviate_client = weaviate.connect_to_local()


weaviate_client.collections.create(
    name="DataStored",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT_ARRAY),
        
    ],
    # Define the vectorizer module (none, as we will add our own vectors)
    vectorizer_config=wc.Configure.Vectorizer.none(),
    generative_config=wc.Configure.Generative.mistral()
    
)

db = WeaviateVectorStore(
    client=weaviate_client,
    index_name="DataStored",  # Your existing class name
    text_key="text",  # The field containing your text data
    embedding=embed
)


template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)


llm = ChatOllama(
    model="mistral:latest",
    temperature=0,
    num_predict = 256
    # other params...
)


rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is the use case")

So when I run this code I get an error stating:

ValidationError: 1 validation error for Document
page_content
  Input should be a valid string [type=string_type, input_value=['Changes in visionLow bl...it'], input_type=list]

when I change the column/feature to a DataType.TEXT I don’t get any error and the RAG retrieval works perfectly fine, My question is why LangChain’s Document and retrieval systems are designed to work with text and not text_array ?

Mohamed_Shahin · January 14, 2025, 2:24pm

Hi @kumaran14,

It’s lovely to have you here—welcome to the community!

LangChain’s Document class is designed to handle individual text entries, with the page_content attribute specifically expecting a single string (str).

https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html

This design choice aligns with the framework’s focus on processing and analyzing individual documents or text segments efficiently.

When integrating with Weaviate, if your data is stored as a TEXT_ARRAY, it doesn’t directly match the expected input type for LangChain’s Document class. This mismatch lead to validation errors, as the system anticipates a single string rather than an array.

Regards,
Mohamed Shahin,
Weaviate Support Engineer

kumaran14 · January 26, 2025, 5:40pm

Thanks for the info @Mohamed_Shahin

Topic		Replies	Views
Weaviate Python Client (Latest) and DateType issue Support	2	143	April 3, 2025
What is text_key supposed to be when using LangChain? Support	5	2645	June 10, 2024
Weaviate Hybrid Retriever issue in Langchain for custom vectors Support	3	1227	December 11, 2023
Unable to connect to vector db using RetrievalQA chain Support	1	767	June 8, 2023
Text[] type not being written or retrieved Support	2	819	August 29, 2023

Using DataType.TEXT_ARRAY as a datatype for a feature/column causes problems with LangChain's Document and retrieval systems

Description

Related topics