Description
I am building a RAG pipeline using LlamaIndex and Weaviate as the vector database.
LlamaIndex has its built-in function to refresh documents and update the affected document chunks, indicating that there is always only one unique entry of the document at a time.
(Reference: Document Management - LlamaIndex)
When I integrated Weaviate, this functionality from LlamaIndex will not work with the following error:
NotImplementedError: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet.
I am facing an issue where let’s say I have created an index with 1 document named “test.txt” with contents “My favourite fruit is apple.”, and then update this document to “My favourite fruit is orange.”.
This creates 2 entries of the exact file “test.txt”, which is not the behaviour that I want. It is supposed to update the database based on the file_path property of the document with the new contents, instead of creating a duplicate copy.
Does anyone know if there is a solution to resolve this?
Server Setup Information
- Weaviate Server Version: 1.24.0
- Deployment Method: Docker
- Multi Node? Number of Running Nodes: Single Node (node 1)
- Running with LlamaIndex version 0.10.13.post1
i @SoftwearEnginear !
Welcome to our community
From Weaviate side, if you insert an object and specify an existing UUID, it will update the object, instead of creating a new one.
Here we describe how this can be done directly in Weaviate:
As you are using llamaindex, I was checking it’s code (I have not used llamaindex much):
AFAIK, you will need to make sure that each of your nodes/chunks, has the node.node_id defined.
I was able to accomplish this:
import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()
client.collections.delete("SoftwearEnginear")
collection = client.collections.create(
"SoftwearEnginear",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
generative_config=wvc.config.Configure.Generative.openai()
)
from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex
from weaviate.util import generate_uuid5
from llama_index.core.schema import TextNode
# define the first node
node1 = TextNode(text="My favourite fruit is apple.", id_=generate_uuid5("test.txt"))
from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import StorageContext
vector_store = WeaviateVectorStore(weaviate_client = client, index_name="SoftwearEnginear", text_key="content")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# we initiate our index
index = VectorStoreIndex.from_documents([], storage_context=storage_context)
#now insert out first node:
index.insert_nodes([node1])
print(collection.aggregate.over_all(total_count=True).total_count)
print(collection.query.fetch_objects().objects[0].properties)
#1
# {'_node_type': 'TextNode', 'content': 'My favourite fruit is apple.',...
# now we define a different content, but same uuid:
node2 = TextNode(text="My favourite fruit is orange.", id_=generate_uuid5("test.txt"))
# insert it
index.insert_nodes([node2])
# we still has one object, now updated:
print(collection.aggregate.over_all(total_count=True).total_count)
print(collection.query.fetch_objects().objects[0].properties)
# we still has 1 object
#1
${'_node_type': 'TextNode', 'content': 'My favourite fruit is orange.',...
Let me know if this helps
1 Like
Thank you so much for the solution! I have to manage over 10,000 files, but will try this approach first.
As previously I am using local LLM and Embedding model saved from huggingface and integrated into the LlamaIndex pipeline, would like to check if I can ignore the vectorizer_config and generative_config.
Assuming that vectorizer_config functions as the vector generator and generative_config as the LLM response generator.
collection = client.collections.create(
“SoftwearEnginear”,
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
generative_config=wvc.config.Configure.Generative.openai()
)
Alternatively, I am open to explore configs that are not relying on any APIs, as I am running in air-gapped environment.
Sure, you can skip that part of the collection configuration.
You can optionally use the hugging face module if you wish to define your local vectorizer.
The downside of not defining vectorizer_config
is that you will not be able to easily do a near_text directly into Weaviate, as near_text relies on that configuration in order to vectorize the query.
Also, not having generative_config
will not allow you to generate content directly from Weaviate.
Of course, you can always vectorize your query yourself, perform a vector search, grab the results and send it to a llm yourself. Thats what llamaindex is doing for you under the hood.
Let me know if this helps