Need help combining weaviate with langchain

ananthan-123 · September 10, 2023, 10:15am

I have a usecase where the users will have many documents. No user will be able to access any other users documents. Also each user can select which files they can access .
I am using weaviate-python client , langchain (RetrievalQAWithSourcesChain).

First I tried to create a single class “Data” which has properties “content” and “source” , then user will be ble to filter the data using the “source” property. But this method has a problem. Even after filtering , the user is able to access other users files.

Then I tried another method. A class for each user and inside the user class , there will be a “data” field that will be linking to the “Data” class.
Below is the schema.

    "classes": [
{
        "class": username,
        "description": f"Class for user {username}",
        "properties": [
            {
                "name": "username",
                "description": "Username of the user",
                "dataType": ["text"]
            },
            {
                "name": "data",
                "description": "Data associated with the user",
                "dataType": ["Data"]
            }
        ]
    },
        {
            "class": "Data",
            "description": "Documents/data in the system",
            "vectorizer": "text2vec-openai",
            "moduleConfig": {"text2vec-openai": {"model": "ada", "type": "text"}},
            "properties": [
                {
                    "name": "content",
                     "description": "The content of the paragraph",
                    "dataType": ["text"],

                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": False,
                            "vectorizePropertyName": False,
                        }
                    },
                }, {
                    "name": "source",
                    "description": "The link to the document",
                    "dataType": ["text"]
                }
            ],
        }
    ]

I am using the below code to create a vectorstore .

vectorstore = Weaviate(client, user, “data{ … on Data { source content }}”, attributes=[‘data { … on Data { source } }’], embedding=embed)

I am getting the below error

KeyError: 'data{ ... on Data { source content }}'

How can I retrieve specific data using the “source” from the user class? Is filtering a good approach?

can anyone help me with this? Thanks in advance.

DudaNogueira · September 11, 2023, 6:08pm

Hi @ananthan-123 !

Welcome to our community

This is a great use case for the new multi tenant feature in Weaviate. So each user will be a tenant and can be added to the class, but with the data isolated from each other.

However, multi tenancy is not yet supported in Langchain.
I have started a PR and Issue here for that:

github.com/langchain-ai/langchain

Multi Tenant Support for Weaviate

opened 08:56PM - 29 Aug 23 UTC

dudanogueira

area: vector store auto:enhancement auto:improvement

### Feature request Weaviate introduced multi-tenancy support in version 1.20 …https://weaviate.io/blog/multi-tenancy-vector-search ### Motivation This can help users using Langchain + Weaviate at scale, ingesting documents and attaching tenants to it. ### Your contribution I have implemented, but would need some help to check if everything is ok and in accordance with LangChain. Also, I would like help on the as_retriver, as I was not able to implement multitenant on it, Yet. the code is living here: https://github.com/dudanogueira/langchain/tree/weaviate-multitenant

With that said, filtering is a possible solution.

Creating one class per user is not the best approach. It will result in multiple vectors spaces for each user, making it hard to scale.

This is how you would get relevant documents with filtering with the current langchain, extracted from here:

github.com

langchain-ai/langchain/blob/8b5662473f4c7daeef1ad7dbbb95b758acbfcd43/libs/langchain/tests/integration_tests/retrievers/test_weaviate_hybrid_search.py#L113C9-L113C81


      
          where_filter = {"path": ["page"], "operator": "Equal", "valueNumber": 0}

from weaviate import Client
from langchain.docstore.document import Document
from langchain.retrievers.weaviate_hybrid_search import WeaviateHybridSearchRetriever

texts = ["foo", "bar", "baz"]
metadatas = [{"page": i} for i in range(len(texts))]

client = Client("http://localhost:8080")

retriever = WeaviateHybridSearchRetriever(
    client=client,
    index_name=f"TestLangRetriever",
    text_key="text",
    attributes=["page"],
)
for i, text in enumerate(texts):
    retriever.add_documents(
        [Document(page_content=text, metadata=metadatas[i])]
    )
where_filter = {"path": ["page"], "operator": "Equal", "valueNumber": 0}

output = retriever.get_relevant_documents("foo", where_filter=where_filter)
print(output)

Output: [Document(page_content=‘foo’, metadata={‘page’: 0})]

let me know if this helps

Lucia_Urcelay · January 17, 2024, 3:29pm

Hi! I want to filter my Weaviate instance butI am using a ConversationalRetrieverChain defined as follows:

rag_chain = ConversationalRetrievalChain.from_llm(
        llm = llm,
        memory = memory,
        retriever = vectorstore.as_retriever(k=k_top_chunks),
        verbose = True,
        combine_docs_chain_kwargs={'prompt': prompt_template},
        get_chat_history = lambda h : h
    )

How can I integrate the following line in my chain?

retriever.get_relevant_documents("foo", where_filter=where_filter)

Also, is there any way to filter my vectorestore using the simple dense search retriever from LangChain and not WeaviateHybridSearchRetriever?

DudaNogueira · January 17, 2024, 3:35pm

Hi!

I believe that this will help you:

github.com

weaviate/recipes/blob/main/integrations/langchain/loading-data/langchain-simple-pdf.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Multilanguage RAG filtering by multiple PDFs with Langchain and Cohere"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Requirement already satisfied: weaviate-client in /Users/dudanogueira/dev/weaviate/recipes/venv/lib/python3.11/site-packages (3.25.3)\n",
      "Requirement already satisfied: langchain in /Users/dudanogueira/dev/weaviate/recipes/venv/lib/python3.11/site-packages (0.0.335)\n",

This file has been truncated. show original

Also note that a new langchain integration is unde development here:

Let me know if that helps!

Thanks!

Lucia_Urcelay · January 17, 2024, 3:49pm

Hi!
Thanks for the fast response. That helps a lot but I wanted to pass the filter parameters when I run the chain, not when I instanciate it. Is there any way to do that?
Thanks in advance!

DudaNogueira · January 17, 2024, 6:18pm

On that example I did it like:

# Let's answer some question
#source_file = "brazil-wikipedia-article-text.pdf"
source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = {
      "operator": "Equal",
      "path": ["source"],
      "valueText": source_file
  }

# we want our retriever to filter the results
retriever = db.as_retriever(search_kwargs={"where_filter": where_filter})

qa = RetrievalQA.from_chain_type(llm=Cohere(model="command-nightly", temperature=0), 
                                 chain_type="stuff", 
                                 retriever=retriever, 
                                 chain_type_kwargs=chain_type_kwargs, 
                                 return_source_documents=True)
                                 
answer = qa({"query": "What is the traditional food of this country?"})
print(answer)

You mean I would like to filter while running the qa?

Lucia_Urcelay · January 19, 2024, 11:14am

Exactly, I mean passing the filter while running the qa. Something like this:

answer = qa(
{"query": "What is the traditional food of this country?"}, 
search_kwargs={"where_filter": where_filter}
)

Is there a way to do this?
In my application the user should be able to make queries selecting different properties to filter from, it would be convenient if I don’t have to re-initialize the qa chain every time the user changes the filter properties.

DudaNogueira · January 22, 2024, 9:42pm

I don’t believe this is possible.

Does it add a lot of overhead?

AFAIK this all should be lazy, meaning, it will only do the queries at the last momment.

hari0205 · April 5, 2024, 1:34pm

You can try this

search_filter=weviate.classes.query.Filter.by_property("userid").equal(userid)
    retriver = db.as_retriever(search_kwargs={"filters":search_filter})
    llm = ChatOpenAI(model="gpt-4-turbo-preview",temperature=0)
    
    qa_chain = RetrievalQA.from_chain_type(llm,retriever=retriver)
    res=qa_chain.invoke({"query":query})

Topic		Replies	Views
Langchain WeaviateHybridSearchRetriever with filters? Support	7	877	August 13, 2024
How to access/search data ingested through Weaviate client in langchain / langchain-weaviate? Support wcs , python	7	676	July 15, 2024
URGENT: Filtering Retrieval Search in Weaviate Based on Tenant-Specific Uploaded Files Support	1	466	January 16, 2024
Weaviate-python-client or langhchain for using weaviate db Support integration	4	907	September 6, 2023
Langchain/Weaviate Support	5	430	March 9, 2025

Need help combining weaviate with langchain

Related topics