How do I overwrite a document if the files are exactly the same when indexing?

SoftwearEnginear · May 15, 2024, 2:38am

Description

I am building a RAG pipeline using LlamaIndex and Weaviate as the vector database.

LlamaIndex has its built-in function to refresh documents and update the affected document chunks, indicating that there is always only one unique entry of the document at a time.
(Reference: Document Management - LlamaIndex)

When I integrated Weaviate, this functionality from LlamaIndex will not work with the following error:

NotImplementedError: Vector store integrations that store text in the vector store are not supported by ref_doc_info yet.

I am facing an issue where let’s say I have created an index with 1 document named “test.txt” with contents “My favourite fruit is apple.”, and then update this document to “My favourite fruit is orange.”.

This creates 2 entries of the exact file “test.txt”, which is not the behaviour that I want. It is supposed to update the database based on the file_path property of the document with the new contents, instead of creating a duplicate copy.

Does anyone know if there is a solution to resolve this?

Server Setup Information

Weaviate Server Version: 1.24.0
Deployment Method: Docker
Multi Node? Number of Running Nodes: Single Node (node 1)
Running with LlamaIndex version 0.10.13.post1

DudaNogueira · May 15, 2024, 9:00pm

i @SoftwearEnginear !

Welcome to our community

From Weaviate side, if you insert an object and specify an existing UUID, it will update the object, instead of creating a new one.

Here we describe how this can be done directly in Weaviate:

As you are using llamaindex, I was checking it’s code (I have not used llamaindex much):

AFAIK, you will need to make sure that each of your nodes/chunks, has the node.node_id defined.

I was able to accomplish this:

import weaviate
from weaviate import classes as wvc
client = weaviate.connect_to_local()
client.collections.delete("SoftwearEnginear")

collection = client.collections.create(
    "SoftwearEnginear",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    generative_config=wvc.config.Configure.Generative.openai()
)

from llama_index.core.schema import TextNode
from llama_index.core import VectorStoreIndex
from weaviate.util import generate_uuid5
from llama_index.core.schema import TextNode

# define the first node
node1 = TextNode(text="My favourite fruit is apple.", id_=generate_uuid5("test.txt"))

from llama_index.vector_stores.weaviate import WeaviateVectorStore
from llama_index.core import StorageContext
vector_store = WeaviateVectorStore(weaviate_client = client, index_name="SoftwearEnginear", text_key="content")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# we initiate our index
index = VectorStoreIndex.from_documents([], storage_context=storage_context)

#now insert out first node:
index.insert_nodes([node1])

print(collection.aggregate.over_all(total_count=True).total_count)
print(collection.query.fetch_objects().objects[0].properties)

#1
# {'_node_type': 'TextNode', 'content': 'My favourite fruit is apple.',...

# now we define a different content, but same uuid:
node2 = TextNode(text="My favourite fruit is orange.", id_=generate_uuid5("test.txt"))

# insert it
index.insert_nodes([node2])

# we still has one object, now updated:
print(collection.aggregate.over_all(total_count=True).total_count)
print(collection.query.fetch_objects().objects[0].properties)

# we still has 1 object
#1
${'_node_type': 'TextNode', 'content': 'My favourite fruit is orange.',...

Let me know if this helps

SoftwearEnginear · May 16, 2024, 1:32am

Thank you so much for the solution! I have to manage over 10,000 files, but will try this approach first.

As previously I am using local LLM and Embedding model saved from huggingface and integrated into the LlamaIndex pipeline, would like to check if I can ignore the vectorizer_config and generative_config.

Assuming that vectorizer_config functions as the vector generator and generative_config as the LLM response generator.

collection = client.collections.create(
“SoftwearEnginear”,
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
generative_config=wvc.config.Configure.Generative.openai()
)

Alternatively, I am open to explore configs that are not relying on any APIs, as I am running in air-gapped environment.

DudaNogueira · May 16, 2024, 6:07pm

Sure, you can skip that part of the collection configuration.

You can optionally use the hugging face module if you wish to define your local vectorizer.

The downside of not defining vectorizer_config is that you will not be able to easily do a near_text directly into Weaviate, as near_text relies on that configuration in order to vectorize the query.

Also, not having generative_config will not allow you to generate content directly from Weaviate.

Of course, you can always vectorize your query yourself, perform a vector search, grab the results and send it to a llm yourself. Thats what llamaindex is doing for you under the hood.

Let me know if this helps

Topic		Replies	Views
Unable to get expected results using BM25 or any search functions Support	8	476	July 3, 2024
Vector database Support	5	185	January 30, 2025
Querying on llama-index Weaviate Vector Store General	4	1819	March 9, 2025
Retrieved document score returns 1.0 (100% relevant) when used with LlamaIndex Support integration	9	1111	June 17, 2024
One time Indexing setup with Weaviate in azure Support	1	243	January 26, 2024

How do I overwrite a document if the files are exactly the same when indexing?

Description

Server Setup Information

Related topics