Need help combining weaviate with langchain

I have a usecase where the users will have many documents. No user will be able to access any other users documents. Also each user can select which files they can access .
I am using weaviate-python client , langchain (RetrievalQAWithSourcesChain).

First I tried to create a single class “Data” which has properties “content” and “source” , then user will be ble to filter the data using the “source” property. But this method has a problem. Even after filtering , the user is able to access other users files.

Then I tried another method. A class for each user and inside the user class , there will be a “data” field that will be linking to the “Data” class.
Below is the schema.

    "classes": [
{
        "class": username,
        "description": f"Class for user {username}",
        "properties": [
            {
                "name": "username",
                "description": "Username of the user",
                "dataType": ["text"]
            },
            {
                "name": "data",
                "description": "Data associated with the user",
                "dataType": ["Data"]
            }
        ]
    },
        {
            "class": "Data",
            "description": "Documents/data in the system",
            "vectorizer": "text2vec-openai",
            "moduleConfig": {"text2vec-openai": {"model": "ada", "type": "text"}},
            "properties": [
                {
                    "name": "content",
                     "description": "The content of the paragraph",
                    "dataType": ["text"],

                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": False,
                            "vectorizePropertyName": False,
                        }
                    },
                }, {
                    "name": "source",
                    "description": "The link to the document",
                    "dataType": ["text"]
                }
            ],
        }
    ]

I am using the below code to create a vectorstore .

vectorstore = Weaviate(client, user, “data{ … on Data { source content }}”, attributes=[‘data { … on Data { source } }’], embedding=embed)

  1. I am getting the below error
KeyError: 'data{ ... on Data { source content }}'
  1. How can I retrieve specific data using the “source” from the user class? Is filtering a good approach?

can anyone help me with this? Thanks in advance.

Hi @ananthan-123 !

Welcome to our community :hugs:

This is a great use case for the new multi tenant feature in Weaviate. So each user will be a tenant and can be added to the class, but with the data isolated from each other.

However, multi tenancy is not yet supported in Langchain.
I have started a PR and Issue here for that:

With that said, filtering is a possible solution.

Creating one class per user is not the best approach. It will result in multiple vectors spaces for each user, making it hard to scale.

This is how you would get relevant documents with filtering with the current langchain, extracted from here:

from weaviate import Client
from langchain.docstore.document import Document
from langchain.retrievers.weaviate_hybrid_search import WeaviateHybridSearchRetriever

texts = ["foo", "bar", "baz"]
metadatas = [{"page": i} for i in range(len(texts))]

client = Client("http://localhost:8080")

retriever = WeaviateHybridSearchRetriever(
    client=client,
    index_name=f"TestLangRetriever",
    text_key="text",
    attributes=["page"],
)
for i, text in enumerate(texts):
    retriever.add_documents(
        [Document(page_content=text, metadata=metadatas[i])]
    )
where_filter = {"path": ["page"], "operator": "Equal", "valueNumber": 0}

output = retriever.get_relevant_documents("foo", where_filter=where_filter)
print(output)

Output: [Document(page_content=‘foo’, metadata={‘page’: 0})]

let me know if this helps :slight_smile:

Hi! I want to filter my Weaviate instance butI am using a ConversationalRetrieverChain defined as follows:

rag_chain = ConversationalRetrievalChain.from_llm(
        llm = llm,
        memory = memory,
        retriever = vectorstore.as_retriever(k=k_top_chunks),
        verbose = True,
        combine_docs_chain_kwargs={'prompt': prompt_template},
        get_chat_history = lambda h : h
    )

How can I integrate the following line in my chain?

retriever.get_relevant_documents("foo", where_filter=where_filter)

Also, is there any way to filter my vectorestore using the simple dense search retriever from LangChain and not WeaviateHybridSearchRetriever?

Hi!

I believe that this will help you:

Also note that a new langchain integration is unde development here:

Let me know if that helps!

Thanks!

Hi!
Thanks for the fast response. That helps a lot but I wanted to pass the filter parameters when I run the chain, not when I instanciate it. Is there any way to do that?
Thanks in advance!

On that example I did it like:

# Let's answer some question
#source_file = "brazil-wikipedia-article-text.pdf"
source_file = "netherlands-wikipedia-article-text.pdf"
where_filter = {
      "operator": "Equal",
      "path": ["source"],
      "valueText": source_file
  }

# we want our retriever to filter the results
retriever = db.as_retriever(search_kwargs={"where_filter": where_filter})

qa = RetrievalQA.from_chain_type(llm=Cohere(model="command-nightly", temperature=0), 
                                 chain_type="stuff", 
                                 retriever=retriever, 
                                 chain_type_kwargs=chain_type_kwargs, 
                                 return_source_documents=True)
                                 
answer = qa({"query": "What is the traditional food of this country?"})
print(answer)

You mean I would like to filter while running the qa?

Exactly, I mean passing the filter while running the qa. Something like this:

answer = qa(
{"query": "What is the traditional food of this country?"}, 
search_kwargs={"where_filter": where_filter}
)

Is there a way to do this?
In my application the user should be able to make queries selecting different properties to filter from, it would be convenient if I don’t have to re-initialize the qa chain every time the user changes the filter properties.

I don’t believe this is possible.

Does it add a lot of overhead?

AFAIK this all should be lazy, meaning, it will only do the queries at the last momment.

You can try this

search_filter=weviate.classes.query.Filter.by_property("userid").equal(userid)
    retriver = db.as_retriever(search_kwargs={"filters":search_filter})
    llm = ChatOpenAI(model="gpt-4-turbo-preview",temperature=0)
    
    qa_chain = RetrievalQA.from_chain_type(llm,retriever=retriver)
    res=qa_chain.invoke({"query":query})
1 Like