What is text_key supposed to be when using LangChain?

Scott_M · August 12, 2023, 7:53am

Hello Weaviate noob here, I created a vectorizer using LangChain that works fine. And it does persist to cloud after creation. But suppose at some later date I want to get the vectorizer without rebuilding it? I believe the line would be:

vectorstore = Weaviate(client=client,index_name=“SamsungS23_v2”,text_key=??? )

My question is what should the “text_key” value be? I created the client very simply like this:

class_obj = {
“class”: “SamsungS23_v2”,
}

and created the vectorizer like this:

vectorstore = Weaviate.from_documents(pages, hf_embeddings, client=client, by_text=False)

How can I “recover” the vectorizer in another run without rebuilding it?? It always barks if i omit “text_key” but I don’t know what to put for it.

THANKS SO MUCH for any LangChain expert here who may have the solution!

DudaNogueira · August 16, 2023, 9:08pm

Hi @Scott_M ! Welcome to our community

I am far from being a LangChain expert. On top of that, I am also a Weaviate noob myself, as I recently joined Weaviate, hehehe.

What I could find is that, while you need to specify text_key for the main Class instantiation, it will have no effect while passing it to from_documents(). It is hardcoded to text here:

github.com

langchain-ai/langchain/blob/a3c79b1909fe1cbe85394c353b0535117ef0cdf0/libs/langchain/langchain/vectorstores/weaviate.py#L411


      
                      weaviate_url="http://localhost:8080"
                  )
          """
          
          client = _create_weaviate_client(**kwargs)
          
          from weaviate.util import get_valid_uuid
          
          index_name = kwargs.get("index_name", f"LangChain_{uuid4().hex}")
          embeddings = embedding.embed_documents(texts) if embedding else None
          text_key = "text"
          schema = _default_schema(index_name)
          attributes = list(metadatas[0].keys()) if metadatas else None
          
          # check whether the index already exists
          if not client.schema.contains(schema):
              client.schema.create_class(schema)
          
          with client.batch as batch:
              for i, text in enumerate(texts):
                  data_properties = {

text_key will be the property where text will be stored:

I am not sure if hardcoding text at from_texts() is a good thing, because you tie all from_documents() and from_texts()name import to that property, leaving no other option while importing content.

So now, it will depend on how you imported your data (if it was using from_documents, your text_key will be text)

So here is something that have worked for me:

from langchain.vectorstores import Weaviate
import weaviate
# considering you have docs, embeddings, dependencies, etc
WEAVIATE_URL = "http://localhost:8080"
db = Weaviate.from_documents(docs, embeddings, weaviate_url=WEAVIATE_URL, by_text=False, index_name="MyIndex")
# now, you can:
client = weaviate.Client(WEAVIATE_URL)
db = Weaviate(client=client, index_name="MyIndex", text_key="text")
db.similarity_search_by_text(query="health")

Let me know if that helps

Thanks!

Vinayak_Hegde · November 24, 2023, 12:44pm

With the latest version of the Langchain, text_key is the field name you want to search.

If you have created vectors like this

question_objs = list()
for i, d in enumerate(data):
    question_objs.append(wvc.DataObject(
        properties={
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        },
        vector=d["vector"]
    ))

questions = client.collections.get("my_class")
questions.data.insert_many(question_objs)

then you can use ‘answer’ as your text_key

vectorstore = Weaviate(client, "my_class", "answer", embedding=embeddings)
retriever = vectorstore.as_retriever(search_type="mmr")

htbit1990 · December 31, 2023, 12:37pm

Thanks it works, even with a BGE Huggingaface embedding, but I see that Langchain creates 2 classes i.e. “MyIndex” and “LangChain”. But it does NOT work for the “WeaviateHybridSearchRetriever” if u define:
retriever = WeaviateHybridSearchRetriever(
client=client,
index_name=“MyIndex”,
text_key=“text”,
attributes=,
embedding=BGEembedding,
create_schema_if_missing=True
)
and response = retriever.get_relevant_documents(query=“some question?”) gives error.
ValueError: Error during query: [{‘locations’: [{‘column’: 6, ‘line’: 1}], ‘message’: ‘get vector input from modules provider: VectorFromInput was called without vectorizer’, ‘path’: [‘Get’, ‘MyIndex’]}]

Default the “LangChain” class use OpenAI ada embedding.

Thanks in advance

Bob1 · June 6, 2024, 2:20am

Hello.If I have two parameters that need to be passed to text_key, how should I handle it? The actual problem I am facing is that I want to return content and source.Thanks.

DudaNogueira · June 10, 2024, 8:32pm

Hi @Bob1 ! Welcome to our community!

the text_key will only receive one parameter. It will be where the main text should be store in Weaviate.

Why exactly you would need two text_key?

the source can be filled automatically, depending on how you ingested the contents using Langchain split texter.

Let me know if this helps.

Thanks!

Topic		Replies	Views
Is there a way to save the weaviate vectorstore in disk, and then reuse it for further querying? General	13	1541	May 17, 2024
Text2vec_openai redundancy via multiple providers? Support integration , technical	4	250	January 28, 2025
How to access/search data ingested through Weaviate client in langchain / langchain-weaviate? Support wcs , python	7	676	July 15, 2024
Is nearText() completely a Weaviate cloud operation, no outbound LLM call? General wcs	12	532	April 22, 2024
Unable to connect to vector db using RetrievalQA chain Support	1	784	June 8, 2023

What is text_key supposed to be when using LangChain?

Related topics