What is text_key supposed to be when using LangChain?

Hello Weaviate noob here, I created a vectorizer using LangChain that works fine. And it does persist to cloud after creation. But suppose at some later date I want to get the vectorizer without rebuilding it? I believe the line would be:

vectorstore = Weaviate(client=client,index_name=“SamsungS23_v2”,text_key=??? )

My question is what should the “text_key” value be? I created the client very simply like this:

class_obj = {
“class”: “SamsungS23_v2”,
}

and created the vectorizer like this:

vectorstore = Weaviate.from_documents(pages, hf_embeddings, client=client, by_text=False)

How can I “recover” the vectorizer in another run without rebuilding it?? It always barks if i omit “text_key” but I don’t know what to put for it.

THANKS SO MUCH for any LangChain expert here who may have the solution!

Hi @Scott_M ! Welcome to our community :hugs:

I am far from being a LangChain expert. On top of that, I am also a Weaviate noob myself, as I recently joined Weaviate, hehehe.

What I could find is that, while you need to specify text_key for the main Class instantiation, it will have no effect while passing it to from_documents(). It is hardcoded to text here:

text_key will be the property where text will be stored:

I am not sure if hardcoding text at from_texts() is a good thing, because you tie all from_documents() and from_texts()name import to that property, leaving no other option while importing content.

So now, it will depend on how you imported your data (if it was using from_documents, your text_key will be text)

So here is something that have worked for me:

from langchain.vectorstores import Weaviate
import weaviate
# considering you have docs, embeddings, dependencies, etc
WEAVIATE_URL = "http://localhost:8080"
db = Weaviate.from_documents(docs, embeddings, weaviate_url=WEAVIATE_URL, by_text=False, index_name="MyIndex")
# now, you can:
client = weaviate.Client(WEAVIATE_URL)
db = Weaviate(client=client, index_name="MyIndex", text_key="text")
db.similarity_search_by_text(query="health")

Let me know if that helps :slight_smile:

Thanks!

With the latest version of the Langchain, text_key is the field name you want to search.

If you have created vectors like this

question_objs = list()
for i, d in enumerate(data):
    question_objs.append(wvc.DataObject(
        properties={
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        },
        vector=d["vector"]
    ))

questions = client.collections.get("my_class")
questions.data.insert_many(question_objs) 

then you can use ‘answer’ as your text_key

vectorstore = Weaviate(client, "my_class", "answer", embedding=embeddings)
retriever = vectorstore.as_retriever(search_type="mmr")
1 Like

Thanks it works, even with a BGE Huggingaface embedding, but I see that Langchain creates 2 classes i.e. “MyIndex” and “LangChain”. But it does NOT work for the “WeaviateHybridSearchRetriever” if u define:
retriever = WeaviateHybridSearchRetriever(
client=client,
index_name=“MyIndex”,
text_key=“text”,
attributes=,
embedding=BGEembedding,
create_schema_if_missing=True
)
and response = retriever.get_relevant_documents(query=“some question?”) gives error.
ValueError: Error during query: [{‘locations’: [{‘column’: 6, ‘line’: 1}], ‘message’: ‘get vector input from modules provider: VectorFromInput was called without vectorizer’, ‘path’: [‘Get’, ‘MyIndex’]}]

Default the “LangChain” class use OpenAI ada embedding.

Thanks in advance

Hello.If I have two parameters that need to be passed to text_key, how should I handle it? The actual problem I am facing is that I want to return content and source.Thanks.

Hi @Bob1 ! Welcome to our community!

the text_key will only receive one parameter. It will be where the main text should be store in Weaviate.

Why exactly you would need two text_key?

the source can be filled automatically, depending on how you ingested the contents using Langchain split texter.

Let me know if this helps.

Thanks!