Using DataType.TEXT_ARRAY as a datatype for a feature/column causes problems with LangChain's Document and retrieval systems

Description


import weaviate
import os
import weaviate.classes as wvc
import weaviate.classes.config as wc
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_ollama import ChatOllama
from langchain_weaviate.vectorstores import WeaviateVectorStore

weaviate_client = weaviate.connect_to_local()


weaviate_client.collections.create(
    name="DataStored",
    properties=[
        wc.Property(name="text", data_type=wc.DataType.TEXT_ARRAY),
        
    ],
    # Define the vectorizer module (none, as we will add our own vectors)
    vectorizer_config=wc.Configure.Vectorizer.none(),
    generative_config=wc.Configure.Generative.mistral()
    
)

db = WeaviateVectorStore(
    client=weaviate_client,
    index_name="DataStored",  # Your existing class name
    text_key="text",  # The field containing your text data
    embedding=embed
)


template = """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)


llm = ChatOllama(
    model="mistral:latest",
    temperature=0,
    num_predict = 256
    # other params...
)


rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

rag_chain.invoke("What is the use case")

So when I run this code I get an error stating:

ValidationError: 1 validation error for Document
page_content
  Input should be a valid string [type=string_type, input_value=['Changes in visionLow bl...it'], input_type=list]

when I change the column/feature to a DataType.TEXT I don’t get any error and the RAG retrieval works perfectly fine, My question is why LangChain’s Document and retrieval systems are designed to work with text and not text_array ?

Hi @kumaran14,

It’s lovely to have you here—welcome to the community!

LangChain’s Document class is designed to handle individual text entries, with the page_content attribute specifically expecting a single string (str).

https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html

This design choice aligns with the framework’s focus on processing and analyzing individual documents or text segments efficiently.

When integrating with Weaviate, if your data is stored as a TEXT_ARRAY, it doesn’t directly match the expected input type for LangChain’s Document class. This mismatch lead to validation errors, as the system anticipates a single string rather than an array.

Regards,
Mohamed Shahin,
Weaviate Support Engineer

Thanks for the info @Mohamed_Shahin

1 Like