How to access/search data ingested through Weaviate client in langchain / langchain-weaviate?

Description

I have written some code that ingests data into weaviate from a postgres or a pandas dataframe. This code relies mostly on methods in weaviate-client (4.6.1). See below for some basics on how it works (this is not a comprehensive working example)

class BatchStream:
    '''
    a class that allows us to batch-stream
    data from a postgres db into weaviate.
    '''
    type_mapping = {
        'int': wc.DataType.INT,
        'str': wc.DataType.TEXT,
        'float': wc.DataType.NUMBER,
        'bool': wc.DataType.BOOL,
        'datetime': wc.DataType.DATE
    }
    def __init__(
            self, 
            client: Client,
            vectorizer_config: wc.Configure.Vectorizer,
            generative_config: wc.Configure.Generative,
            postgres_string: str = 'test',
            weaviate_batch_size: int = 100,
            n_workers: int = 4,
        ) -> None:
        self.client = client
        self.vectorizer_config = vectorizer_config
        self.generative_config = generative_config
        self.postgres_string = postgres_string
        self.weaviate_batch_size = weaviate_batch_size
        self.n_workers = n_workers

        logging.info("connecting to postgres db...")
        self.engine = create_engine(self.postgres_string)
        logging.info("connected to database")


        def stream_insert_many(
            self,
            collection: str,
            create_collection: bool = True,
            query: str | None = None,
            df: pd.DataFrame | None = None,
            properties: Optional[Union[List[wc.Property], str]] = 'infer') -> None:
        '''
        method that streams data retrieved from a 
        postgres query, or a df, into weaviate. 
        '''
        if df and query:
            raise ValueError("Supply either a dataframe or a sql query, not both.")
        
        if df is None and query is None:
            raise ValueError("Supply either a dataframe or a sql query.")
        
        if properties == 'infer':
            properties = self._infer_weaviate_properties(df, query)

        if create_collection:
            coll = self.client.collections.create(
                name=collection,
                properties=properties,

                # vectorizer
                vectorizer_config=self.vectorizer_config,
                # generative module
                generative_config=self.generative_config
            )
        else:
            coll = self.client.collections.get(collection)

        insert_obj = []
        if df is None and query is not None:
            for chunk in self._sql_query_to_chunked_df(query):
                insert_obj.extend(chunk.to_dict(orient='records'))
            
        else:
            insert_obj = df.to_dict(orient='records')
        
        coll.data.insert_many(insert_obj)

Now, I can easily access this data through weaviate-client, such as count it or run similarity_search on it.

For the purpose of building a RAG application, I want to query it via langchain. Langchain seems to be pretty straight-forward, and well integrated with weaviate via the langchain-weaviate package for python. So, following various tutorials, e.g. 1, 2 or 3, I am able to init all the required objects using langchain-weaviate:

from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_core.messages import HumanMessage
from langchain_openai import OpenAIEmbeddings, ChatOpenAI, OpenAI
from langchain.chains import ChatVectorDBChain

client = . . .

client.is_ready()
>>> True

vectorstore = WeaviateVectorStore(
    embedding=embeddings,
    client=client,
    index_name="my_collection",
    text_key="my_target_key"
)

Now, I want to query my collection via the methods on the langchain_weaviate.vectorstores.WeaviateVectorstore. I attempt it like so:

vectorstore.similarity_search('test')

Now, this returns an empty list [].

If instead I run collection.query.near_text() from the weaviate client, I get relevant documents returned.

Hence follows the question:

How can I perform queries on a weaviate collection via langchain if data was ingested using the weaviate client library?

Server Setup Information

  • Weaviate Server Version: 1.24.13
  • Deployment Method: weaviate cloude
  • Multi Node? Number of Running Nodes:
  • Client Language and Version: Python 3.11.3, client v 4.6.1
  • Multitenancy?: No

Hi @nik !!

I believe this is an interesting recipe to add here:

We are looking for contributors, by the way :wink:

Here is some code I believe can help you:

Ps: used this to install the required libs:

!pip3 install -U weaviate-client langchain-weaviate langchain-openai

from weaviate.classes import config

# lets first create our collection and import data

client.collections.delete("MyCollection")
collection = client.collections.create(
    "MyCollection",
    vectorizer_config=config.Configure.Vectorizer.text2vec_openai(),
    properties=[
        config.Property(name="text", data_type=config.DataType.TEXT),
        config.Property(name="source", data_type=config.DataType.TEXT)
    ]
)

collection.data.insert({"text": "something about cats", "source": "document1"})
collection.data.insert({"text": "something about tiger", "source": "document1"})
collection.data.insert({"text": "something about lion", "source": "document1"})

collection.data.insert({"text": "something about dogs", "source": "document2"})
collection.data.insert({"text": "something about wolf", "source": "document2"})
collection.data.insert({"text": "something about coyotes", "source": "document2"})

Now this is how you would search using Weaviate directly:

collection = client.collections.get("MyCollection")
response = collection.query.near_text(query="pet animals")
for object in response.objects:
    print(object.properties)

this will output something like:

{‘text’: ‘something about dogs’, ‘source’: ‘document2’}
{‘text’: ‘something about cats’, ‘source’: ‘document1’}
{‘text’: ‘something about coyotes’, ‘source’: ‘document2’}
{‘text’: ‘something about lion’, ‘source’: ‘document1’}
{‘text’: ‘something about tiger’, ‘source’: ‘document1’}
{‘text’: ‘something about wolf’, ‘source’: ‘document2’}

Now, in order to search using this same data with LangChain, you can:

from langchain_openai import OpenAIEmbeddings
from langchain_weaviate.vectorstores import WeaviateVectorStore

embeddings = OpenAIEmbeddings()
db = WeaviateVectorStore.from_documents([], embeddings, client=client, index_name="MyCollection")

# perform a query
docs = db.similarity_search("pet animals")
print(docs)

And this would be the output:

[Document(page_content=‘something about dogs’, metadata={‘source’: ‘document2’}), Document(page_content=‘something about cats’, metadata={‘source’: ‘document1’}), Document(page_content=‘something about coyotes’, metadata={‘source’: ‘document2’}), Document(page_content=‘something about lion’, metadata={‘source’: ‘document1’})]

Let me know if this helps :slight_smile:

2 Likes

@DudaNogueira , once again thank you very much for helping me out. I’d be happy to share a recipe on filling a weaviate with postgres data once I’ve got my solution to work properly :smiley: !

Out of interest, one more question regarding the behaviour of collection.query.near_text('animal', limit=3):

  • when i run this (in an interactive session) for the first time, it works really well and returns the closest documents.
  • but when i modify the query string, to e.g. ‘football’, i.e. something both semantically and ortographically completely different, it returns exactly the same documents.
  • the only way I can get the ‘correct’ documents for my new query is by re-starting my interactive session.

Is there a workaround for this? Is this the intended behaviour?

Thanks!

hi @nik !!

What is this interactive session you mentioned? Not sure I follow it :thinking:

Note that, on a similarity search, it will always return objects. They may not be close / related to your query. On those cases, the distance will be high, but the objects will still be there.

So, for example, if you you only have ten objects, it will always return those 10, but ordered by the closest to farthest to your query.

If you could reproduce this interactive with the given example, I would then be able to understand what is happening :slight_smile:

Let me know if this helps :slight_smile:

hi @DudaNogueira

thanks for getting back to me. What I meant by ‘interactive’ is that you’re running code either in e.g. ipython or the regular python repl, or jupyter. that is you are typing stuff and getting stuff returned as you go along, rather than having a script or a server/app running.

in such an example, if you search for something like this:
response = collection.query.near_text('animal', limit=3) then print the results as you would normally, and then run e.g. response = collection.query.near_text('football', limit=3), and then print the objects in response, it will show the same objects as for the animal search (which likely is quite close to what you’re looking for with the first query, but not the second.

I wonder if this behaviour is expected or a bug; and if there is any way to circumvent it. Thanks!

Oh, I see.

I believe this will depend on the objects you have and the query.

Small queries (like a single word) may have not enough meaning to change the limited dataset (assuming you don’t have a lot of objects).

Can you also print the distance for that query?

something like:

response = collection.query.near_text('animal', limit=3, return_metadata=wvc.query.MetadataQuery(distance=True))  

Thanks!

Hello @DudaNogueira !

I would like to add on this question, what if the collection was divided into tenants using the weaviate-client.

Currently if I run the code as such:

from langchain_weaviate.vectorstores import WeaviateVectorStore
from langchain_community.embeddings import OllamaEmbeddings

import weaviate

client = weaviate.connect_to_local() 

embeddings = OllamaEmbeddings(model = 'mxbai-embed-large')
db = WeaviateVectorStore.from_documents(
    [], 
    embeddings, 
    client=client, 
    index_name="Books"
)

I get the following error:

  File "/home/.../lib/python3.10/site-packages/langchain_weaviate/vectorstores.py", line 537, in _tenant_context
    raise ValueError("Must use tenant context when multi-tenancy is enabled")   
ValueError: Must use tenant context when multi-tenancy is enabled

How can I enable multi-tenancy in this case?

hi @omarsinno !!

Welcome to our community!

We do have multi tenancy support in Weaviate. Check here the docs:

for instance, the code example:

db_with_mt = WeaviateVectorStore.from_documents(
    docs, embeddings, client=weaviate_client, tenant="Foo"
)

Let me know if that helps!

Thanks!