But I canāt find this functionality in Weaviateās documentation despite a lot of searching.
Hereās my full code for a local document querying app Iām building:
import os, weaviate
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import DirectoryLoader
from langchain.vectorstores.weaviate import Weaviate
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# get environment variables
OPENAI_API_KEY = os.environ.get["OPENAI_API_KEY"]
WEAVIATE_API_KEY = os.environ.get["WEAVIATE_API_KEY"]
# load documents
doc_loader = DirectoryLoader(
'./Docs', # the relative directory address
glob='**/*.pdf', # load all pdf files in every subdirectory
show_progress=True
)
docs = doc_loader.load()
# split documents
splitter = CharacterTextSplitter(
chunk_size=1000,
chunk_overlap=300
)
splitted_docs_list = splitter.split_documents(docs)
auth_config = weaviate.auth.AuthApiKey(api_key=os.environ.get('WEAVIATE_API_KEY'))
client = weaviate.Client(
url="https://weaviate-sandbox-cluster-xxxxxxx.weaviate.network",
auth_client_secret=auth_config,
additional_headers={
"X-OpenAI-Api-Key": os.environ.get('OPENAI_API_KEY')
}
)
# set index_name and vectorizer for the database
class_obj = {
"class": "LangChain",
"vectorizer": "text2vec-openai",
}
try:
# Add the class to the schema
client.schema.create_class(class_obj)
except:
print("Class already exists")
embeddings = OpenAIEmbeddings()
# I use 'LangChain' for index_name and 'text' for text_key
vectorstore = Weaviate(client, "LangChain", "text", embedding=embeddings)
# Add text chunks' embeddings to the Weaviate vector database
texts = [d.page_content for d in splitted_docs_list]
metadatas = [d.metadata for d in splitted_docs_list]
# vectorstore.add_texts(texts, metadatas=metadatas, embedding=embeddings)
vectorstore = Weaviate.from_texts(
texts,
embeddings,
metadatas=metadatas,
client=client,
)
# Query the vectorstore with the LLM
llm = ChatOpenAI()
retrieval_qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type='stuff',
retriever=vectorstore.as_retriever(),
)
print(retrieval_qa.run(query))
Everytime I run a query, all the docs are read, split into chunks, and the vectorstore is created, again and again. So, is it possible to save the vector store in disk instead and resuse it for further querying?
I canāt speak to the LangChain specifics, but whenever you run code to ingest objects to WCS, the created objects will persist unless you specifically delete them.
If you just run the query part of that code snippet what happens?
If the vectorstore is persistent within the weaviate cluster, could you please show a snippet of just how to access or use it for querying? When I go to https://console.weaviate.cloud/dashboard and try to open my cluster, it canāt be opened and I canāt see its contents. So I donāt even know what name are my vectorstores saved by internally.
This can be done.Two points should be paied attention to:
1.When you create āvector databaseā you should specify āindex_nameā and ātext_keyā, which corresponding to āclassā and āproperty nameā.For example:
import weaviate
from weaviate.embedded import EmbeddedOptions
client = weaviate.Client(
embedded_options = EmbeddedOptions(),
additional_headers = {
'X-OpenAI-Api-Key': 'sk-xxx'
}
)
class_obj = {
"class": "Article",
"properties": [
{
"name": "body",
"dataType": ["text"],
},
],
"vectorizer": "text2vec-openai"
}
client.schema.create_class(class_obj)
texts = [d.page_content for d in splitted_docs_list]
metadatas = [d.metadata for d in splitted_docs_list]
vectorstore = Weaviate.from_texts(
texts,
OpenAIEmbeddings(),
metadatas=metadatas,
client=client,
by_text = False,
index_name = "Article",
text_key = "body")
2.When next time we want to reuse the ādatabaseā, we should only init āWeaviateā with correct āindex_nameā and ātext_keyā.For example:
Fortunately, now, there is a new Langchain integration:
Also, here we have a nice recipe with that new integration that also show how to do some other ātricksā and uses the new Weaviate Python Client v4:
Hi, @DudaNogueira Thanks again for supplying v4 API info.However, because the class āWeaviateā in ālangchain.vectorstoresā only support v3 API, so I have to use v3 version weaviate āClientā class.Iāve update the python package ālangchainā and āweaviate-clientā to latest version: