Is there a way to save the weaviate vectorstore in disk, and then reuse it for further querying?

Kristada673 · September 19, 2023, 11:04am

We have this functionality in Chroma DB: Chroma | 🦜️🔗 Langchain

But I can’t find this functionality in Weaviate’s documentation despite a lot of searching.

Here’s my full code for a local document querying app I’m building:

	import os, weaviate
	from langchain.text_splitter import CharacterTextSplitter
	from langchain.document_loaders import DirectoryLoader
	from langchain.vectorstores.weaviate import Weaviate
	from langchain.embeddings import OpenAIEmbeddings
	from langchain.chains import RetrievalQA
	from langchain.chat_models import ChatOpenAI

	# get environment variables
	OPENAI_API_KEY = os.environ.get["OPENAI_API_KEY"]
	WEAVIATE_API_KEY = os.environ.get["WEAVIATE_API_KEY"]

	# load documents
	doc_loader = DirectoryLoader(
	    './Docs', # the relative directory address
	    glob='**/*.pdf',      # load all pdf files in every subdirectory
	    show_progress=True
	)
	docs = doc_loader.load()

	# split documents
	splitter = CharacterTextSplitter(
	    chunk_size=1000, 
	    chunk_overlap=300
	)
	splitted_docs_list = splitter.split_documents(docs)

	auth_config = weaviate.auth.AuthApiKey(api_key=os.environ.get('WEAVIATE_API_KEY'))
	client = weaviate.Client(
	    url="https://weaviate-sandbox-cluster-xxxxxxx.weaviate.network",
	    auth_client_secret=auth_config,
	    additional_headers={
	        "X-OpenAI-Api-Key": os.environ.get('OPENAI_API_KEY')
	    }
	)

	# set index_name and vectorizer for the database
	class_obj = {
	    "class": "LangChain",
	    "vectorizer": "text2vec-openai",
	}

	try:
	  # Add the class to the schema
	  client.schema.create_class(class_obj)
	except:
	  print("Class already exists")

	embeddings = OpenAIEmbeddings()
	# I use 'LangChain' for index_name and 'text' for text_key
	vectorstore = Weaviate(client, "LangChain", "text", embedding=embeddings)


	# Add text chunks' embeddings to the Weaviate vector database

	texts = [d.page_content for d in splitted_docs_list]
	metadatas = [d.metadata for d in splitted_docs_list]
	# vectorstore.add_texts(texts, metadatas=metadatas, embedding=embeddings)

	vectorstore = Weaviate.from_texts(
	    texts,
	    embeddings,
	    metadatas=metadatas,
	    client=client,
	)

	# Query the vectorstore with the LLM
	llm = ChatOpenAI()
	retrieval_qa = RetrievalQA.from_chain_type(
	    llm=llm, 
	    chain_type='stuff',
	    retriever=vectorstore.as_retriever(),
	)

	print(retrieval_qa.run(query))

Everytime I run a query, all the docs are read, split into chunks, and the vectorstore is created, again and again. So, is it possible to save the vector store in disk instead and resuse it for further querying?

jphwang · September 19, 2023, 11:43am

Hi @Kristada673

I can’t speak to the LangChain specifics, but whenever you run code to ingest objects to WCS, the created objects will persist unless you specifically delete them.

If you just run the query part of that code snippet what happens?

Kristada673 · September 19, 2023, 11:50am

Thanks for your prompt response, @jphwang

Do you mean if I delete everything before the llm = ChatOpenAI() and run the script? I get the following error:

NameError: name 'vectorstore' is not defined

Kristada673 · September 19, 2023, 11:52am

If the vectorstore is persistent within the weaviate cluster, could you please show a snippet of just how to access or use it for querying? When I go to https://console.weaviate.cloud/dashboard and try to open my cluster, it can’t be opened and I can’t see its contents. So I don’t even know what name are my vectorstores saved by internally.

jphwang · September 19, 2023, 1:22pm

X-posting here from Slack.

As far as Weaviate goes:

You can get a list of available classes like this: Manage classes (object collections) | Weaviate - vector database

And I think from your code snippet it looks like it’s getting saved under the class name “LangChain”.

You should be able to retrieve objects as shown here. Read all objects | Weaviate - vector database

And you can search the data using any of these methods: Search | Weaviate - vector database

I don’t use LangChain myself so I can’t answer exactly what functions to use to query data that’s already in Weaviate unfortunately.

Steven.Hu · May 11, 2024, 6:38am

This can be done.Two points should be paied attention to:
1.When you create ‘vector database’ you should specify ‘index_name’ and ‘text_key’, which corresponding to ‘class’ and ‘property name’.For example:

import weaviate
from weaviate.embedded import EmbeddedOptions

client = weaviate.Client(
    embedded_options = EmbeddedOptions(),
    additional_headers = {
        'X-OpenAI-Api-Key': 'sk-xxx'
    }
)
class_obj = {
    "class": "Article",
    "properties": [
        {
            "name": "body",
            "dataType": ["text"],
        },
    ],
    "vectorizer": "text2vec-openai"
}
client.schema.create_class(class_obj)

texts = [d.page_content for d in splitted_docs_list]
metadatas = [d.metadata for d in splitted_docs_list]
vectorstore = Weaviate.from_texts(
texts,
OpenAIEmbeddings(),
metadatas=metadatas,
client=client,
by_text = False,
index_name = "Article",
text_key = "body")

2.When next time we want to reuse the ‘database’, we should only init ‘Weaviate’ with correct ‘index_name’ and ‘text_key’.For example:

vectorstore = Weaviate(client, "Article", "body")

DudaNogueira · May 12, 2024, 1:47pm

hi @Steven.Hu !! Welcome to our community!

I believe that by the time this thread was created there was a bug in the Langchain integration that was preventing this to be possible

Fortunately, now, there is a new Langchain integration:

Also, here we have a nice recipe with that new integration that also show how to do some other “tricks” and uses the new Weaviate Python Client v4:

Thanks!

Steven.Hu · May 12, 2024, 2:33pm

Thank you for supplying this info!Thanks a lot!I’ll have a look.

Steven.Hu · May 12, 2024, 2:34pm

Thank you for supplying this info!Thanks a lot!I’ll hava a look.

Steven.Hu · May 13, 2024, 3:44am

Hi, @DudaNogueira Thanks again for supplying v4 API info.However, because the class ‘Weaviate’ in ‘langchain.vectorstores’ only support v3 API, so I have to use v3 version weaviate ‘Client’ class.I’ve update the python package ‘langchain’ and ‘weaviate-client’ to latest version:

langchain 0.1.20
langchain-community 0.0.38
langchain-core 0.1.52
langchain-text-splitters 0.0.1
langsmith 0.1.54

weaviate-client 4.6.0

DudaNogueira · May 14, 2024, 1:26pm

the new pyv4 package includes both v3 and v4 version.

The difference will be the way you initiate the client.

This way you can migrate your code as step by step

Steven.Hu · May 15, 2024, 1:52am

LangChain use the ‘schema’ property of ‘Client’ class, so only V3 ‘Client’ can be used.

DudaNogueira · May 16, 2024, 1:55pm

This for the old integration.

The new one already leverages the new python v4 client:

Steven.Hu · May 17, 2024, 12:43am

Thanks again, I took langchain for langchain-weaviate for mistake, sorry.

Topic		Replies	Views
How to get the Vector Store from Document Splitted and Embedding Support python	3	825	June 10, 2024
Querying the Exisitng Cluster in Weavaite Support python	2	268	July 24, 2024
How do I modify this script to create a weaviate vectorstore for multiple documents instead of one? General	1	551	November 1, 2023
How to access/search data ingested through Weaviate client in langchain / langchain-weaviate? Support wcs , python	7	955	July 15, 2024
Should I choose Weaviate for my first project? Support	2	483	December 19, 2023

Is there a way to save the weaviate vectorstore in disk, and then reuse it for further querying?

Related topics