Langchain WeaviateHybridSearchRetriever with filters?

I am currently building a Q&A interface with Streamlit and Langchain. Our initial vector database was in Pinecone. We have documents about the same topic, but different industries. Pure embedding search is not optimal, as it will match the same concepts across industries. So, we build a simple selector option where users pick their industry, and then ask the question. In pinecone each industry had their own namespace, we then simply filter on this:

vectorstore = PineconeVectorStore(index_name=index_name, embedding=embeddings, namespace=namespace)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 3})

Hybrid search with pinecone is not as convenient as with Weaviate, and since we noticed beter performance with hybrid search we are switching to Weaviate. The downside is that filters are not so clear for the Weaviate retriever.

retriever = WeaviateHybridSearchRetriever(
        client=client,
        index_name=WEAVIATE_INDEX_NAME,
        text_key="page_content",
        k=5,
        alpha=0.75,
        attributes=["file_name", "industry],
        create_schema_if_missing=False,
    )

Our Langchain Chain looks similar to this ( langchain/templates/hybrid-search-weaviate/hybrid_search_weaviate/chain.py at master · langchain-ai/langchain · GitHub ):

# RAG prompt
template = """Answer the question based only on the following context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)

# RAG
model = ChatOpenAI()
chain = (
    RunnableParallel({"context": retriever, "question": RunnablePassthrough()})
    | prompt
    | model
    | StrOutputParser()
)

The docs do show this:

retriever.invoke(
    "AI integration in society",
    where_filter={
        "path": ["author"],
        "operator": "Equal",
        "valueString": "Prof. Jonathan K. Sterling",
    },
)

Does anyone know how/where to add the where_filter parameter for Weaviate hybrid search in the Chain?

hi @Just_Guide7361 !!

Welcome to our community :hugs:

Sorry, your topic was stuck on some anti spam check :frowning:

We have a recipe that you will probably benefit here:

For instance, this is how you can use langchain and filters:

from weaviate import classes as wvc
# change bellow to get chunks per different files / countries
source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = wvc.query.Filter.by_property("source").equal(source_file)
docs = db.similarity_search("traditional food", filters=where_filter)
print(docs)

Let me know if this helps!

Thanks!

Hi!

I have just updated that langchain recipe as it had some deprecations.

here is the part you are interested:

from langchain_openai import OpenAI
from langchain.chains import RetrievalQA

# Let's answer some question
#source_file = "brazil-wikipedia-article-text.pdf"
source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = wvc.query.Filter.by_property("source").equal(source_file)

# we want our retriever to filter the results
retriever = db.as_retriever(search_kwargs={"filters": where_filter})

qa = RetrievalQA.from_chain_type(llm=OpenAI(openai_api_key=os.environ.get("OPENAI_API_KEY")),
                                 chain_type="stuff", 
                                 retriever=retriever, 
                                 chain_type_kwargs=chain_type_kwargs, 
                                 return_source_documents=True)
                                 
answer = qa({"query": "What is the traditional food of this country?"})
print(answer)

While this example only uses one operand filter, you can easily add more logic.

For example multiple operands:

And nested filters:

Hope this helps!

Thanks!

@DudaNogueira thank you for the quick reply. However, using the RetrievalQA is not ideal. As this one is deprecated in newer versions (langchain.chains.retrieval_qa.base.RetrievalQA — 🦜🔗 LangChain 0.2.12).

They recommend using create_retrieval_chain (langchain.chains.retrieval.create_retrieval_chain — 🦜🔗 LangChain 0.2.12). Which is using the LCEL principles.

Are there any plans to update the recipe/examples with this?

2 Likes

Hi @Just_Guide7361 !!

Thanks for pointing it out!!

I will take the opportunity and also write a recipe using the multi tenancy feature with langchain.

here is a working code using create_retrieval_chain (I will update the recipe later today):

# ...
from weaviate.classes.query import Filter

# client = weaviate.connect_to_weaviate_cloud(...)

embeddings = OpenAIEmbeddings()
db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name="WikipediaLangChain")

source_file = "brazil-wikipedia-article-text.pdf"
#source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = Filter.by_property("source").equal(source_file)

# we want our retriever to filter the results
retriever = db.as_retriever(search_kwargs={"filters": where_filter})

system_prompt = (
    "You are an assistant for question-answering tasks. "
    "Use the following pieces of retrieved context to answer "
    "the question. If you don't know the answer, say that you "
    "don't know. Use three sentences maximum and keep the "
    "answer concise."
    "\n\n"
    "{context}"
)

prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

llm = ChatOpenAI(model="gpt-4o-mini")
question_answer_chain = create_stuff_documents_chain(llm, prompt)
rag_chain = create_retrieval_chain(retriever, question_answer_chain)

response = rag_chain.invoke({"input": "What is he traditional food of this country?"})
print(response["answer"])

By the way, we host a lot of online and in presence webinars and workshops. Check it out: Online Workshops & Events | Weaviate - Vector Database

Thanks and hope you are enjoying your “Weaviate journey”!!

2 Likes

Nevermind fixed it.

Hey, thank you for the quick replies. I have tried your example but sadly it does not work. When initialising the db, I get an “list index out of range error”.

Here is my code:

from langchain_cohere import CohereEmbeddings
import weaviate
from weaviate.classes.init import Auth
from weaviate.classes.query import Filter

embeddings = CohereEmbeddings(model=EMBEDDINGS_MODEL, cohere_api_key=COHERE_API_KEY)

headers = {
    "X-Cohere-Api-Key": COHERE_API_KEY,
}

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=WEAVIATE_URL,  
    auth_credentials=Auth.api_key(WEAVIATE_API_KEY), 
    headers=headers,
)

db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name=index_name)

should become: 

db = WeaviateVectorStore(embeddings= embeddings, client=client, index_name=index_name)

where_filter = Filter.by_property(property_to_filter).equal(selected_property_by_user)
retriever = db.as_retriever(search_kwargs={"filters": where_filter, "alpha": 0.8})
retrieved_files = retriever.invoke(user_query)

I’ve inserted my documents as follows:

embeddings = CohereEmbeddings(
    model=EMBEDDINGS_MODEL,
    cohere_api_key=COHERE_API_KEY,
)

db = WeaviateVectorStore.from_documents(langchain_document, embeddings, client=client, index_name=index_name)




Using the weaviate client I am able to retrieve documents, when I initialise the db with the langchain_document I am also able to retrieve, but when I initialise it with an empty array it does not work.

Ideally ofcourse I do not have to pass the langchain_document to the db each time I want to use the weaviate db.

Can you point out where I am going wrong?

Thanks!

Hi @Just_Guide7361 !

I understand you were able to make it work, right?

Let me know if there is any other blocker we can help you with.

We are here to help you on this journey :slight_smile:

Thanks!

Hi again @Just_Guide7361 !!

I believe this thread is related to the issue you had:

Thanks for sharing the solution! :heart: