How to load existing db to similarity search?

weaviate-client==4.7.1
langchain-weaviate==0.0.2
langchain==0.2.11

I am able to create a simple example to create a ‘db’ and use that db to do inference in one flow:

from bge import bge_m3_embedding

print(f'Read in text ...')
loader = TextLoader('state_of_the_union.txt')
documents = loader.load()

print('Split text ...')
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

print('Load embedding model ...')
embedding_model = bge_m3_embedding
print('Embed docs ...')
weaviate_client = weaviate.connect_to_local()
db = WeaviateVectorStore.from_documents(docs, embedding_model, client=weaviate_client, index_name='test')

#db = WeaviateVectorStore.from_documents([], embedding_model, client=weaviate_client, index_name='test')
# print('Perform search ...')
query = 'What did the president say about Ketanji Brown Jackson'
results = db.similarity_search_with_score(query, alpha=1)
for i, doc in enumerate(results):
    print(f'{i}--->{doc[1]:.3f}')
print(results[0])
#
weaviate_client.close()

This works all fine. The db is created and similar docs are retrieved. However, now if I wan to use this ‘db’ to run the same query, I got an outofindex error:

print('Load embedding model ...')
embedding_model = bge_m3_embedding
print('Load embedded docs ...')
weaviate_client = weaviate.connect_to_local()

db = WeaviateVectorStore.from_documents([], embedding_model, client=weaviate_client, index_name='test')
# print('Perform search ...')
query = 'What did the president say about Ketanji Brown Jackson'
results = db.similarity_search_with_score(query, alpha=1)
for i, doc in enumerate(results):
    print(f'{i}--->{doc[1]:.3f}')
print(results[0])

And the error message is below:

Traceback (most recent call last):
  File "/Users/I747411/ai/lc_weaviate.py", line 22, in <module>
    db = WeaviateVectorStore.from_documents([], embedding_model, client=weaviate_client, index_name='test')
  File "/Users/I747411/ai/venv/lib/python3.10/site-packages/langchain_core/vectorstores/base.py", line 1058, in from_documents
    return cls.from_texts(texts, embedding, metadatas=metadatas, **kwargs)
  File "/Users/I747411/ai/venv/lib/python3.10/site-packages/langchain_weaviate/vectorstores.py", line 487, in from_texts
    weaviate_vector_store.add_texts(texts, metadatas, tenant=tenant, **kwargs)
  File "/Users/I747411/ai/venv/lib/python3.10/site-packages/langchain_weaviate/vectorstores.py", line 165, in add_texts
    embeddings = self._embedding.embed_documents(list(texts))
  File "/Users/I747411/ai/venv/lib/python3.10/site-packages/langchain_community/embeddings/huggingface.py", line 331, in embed_documents
    embeddings = self.client.encode(
  File "/Users/I747411/ai/venv/lib/python3.10/site-packages/sentence_transformers/SentenceTransformer.py", line 565, in encode
    if all_embeddings[0].dtype == torch.bfloat16:
IndexError: list index out of range
/Users/I747411/ai/venv/lib/python3.10/site-packages/weaviate/warnings.py:303: ResourceWarning: Con004: The connection to Weaviate was not closed properly. This can lead to memory leaks.
            Please make sure to close the connection using `client.close()`.

Please see the error message: “IndexError: list index out of range”.

What’s the proper way to use existing vector db to do inference? Please help!

I added a little more: I am using BGE-M3 embedding:

model_kwargs = {"device": "cpu"}
from langchain_community.embeddings import HuggingFaceBgeEmbeddings
bge_m3_embedding = HuggingFaceBgeEmbeddings( model_name="BAAI/bge-m3", encode_kwargs=model_kwargs)

As said, this works if the building of vectorstore and query in the same flow. However, it reports the error if I do querying only:
db = WeaviateVectorStore.from_documents([], embedding_model, client=weaviate_client, index_name='test')

On the other hand, if I use OpenAI ‘text-embedding-ada-002’ embedding, it has no issue in both cases. Does it have something to do with embedding method?

Hope this additional info helps

Hi @MartinMin !!

O believe you are facing the same issue from this thread:

This error seems to come from hugging face interface at langchain:

langchain_community/embeddings/huggingface.py", line 331,

can you try this?

db = WeaviateVectorStore.from_documents(embeddings=embedding_model, client=weaviate_client, index_name='test')

My guess is that the huggingface interface in langchain is not checking when the documents passed as parameter is an empty list.

I think that is almost it. This should do the trick!

db = WeaviateVectorStore(embeddings= embeddings, client=client, index_name=index_name)

The initializer way works mostly, but ‘text_key’ needs to be added:
db = WeaviateVectorStore(embedding=embedding_model, client=weaviate_client, index_name='test3', text_key='text')

Another questions: what does ‘text_key’ exactly mean? I understand it should be the textual field that you want to index, for example:

{
  'title': "a test",
  "content": 'This is a paragraph'.
}

In this case, I would set text_key=“content”. However, you look at this example, it sets text_key = ‘text’, but the document doesn’t have ‘text’ field at all. It only have ‘page_content’, so I am confused!
https://python.langchain.com/v0.2/docs/integrations/retrievers/weaviate-hybrid/

hi!

the text_key, for Langchain, will be the property where it will store the actual content chunk in the vector store.

On your case, you can pass it as text_key as “content” and title goes as metadata.

Check this topic:

It’s a little bit “old” as the code for Weaviate + Langchain has changed, but I have digged into this in the past.

But in the example I pasted above, why does it still work?

The document doesn't have a 'text' field, but text_key='text' still work?

This is because it will only use text_key on ingestion.

Take this, for example:

# single insertion
from langchain_openai import OpenAIEmbeddings
from langchain_weaviate.vectorstores import WeaviateVectorStore

from langchain.schema import Document

embeddings = OpenAIEmbeddings()
doc1 = Document(
    page_content="this is the page content",
    metadata={"metadata1": "something", "metadata2": "other thing"}
)
db = WeaviateVectorStore.from_documents([doc1], embeddings, client=client, index_name="TestCollection")
print(client.collections.get("TestCollection").query.fetch_objects().objects[0].properties)

this will print:

{‘text’: ‘this is the page content’,
‘metadata2’: ‘other thing’,
‘metadata1’: ‘something’}

now if we change text_key, this is what will happen:

db = WeaviateVectorStore.from_documents([doc1], embeddings, client=client, index_name="TestCollection2", text_key="text_goes_here")
print(client.collections.get("TestCollection2").query.fetch_objects().objects[0].properties)

and the output:

{‘metadata2’: ‘other thing’,
‘text_goes_here’: ‘this is the page content’,
‘metadata1’: ‘something’}

Note that, as you passed some metadata (and changed the text_key), you can now use the hybrid weights to lean your query towards a specific metadata field.

When you are specifying the query properties, you will need to specify the very same text_key that you set for your collection.

For example, is you set text_key as text_goes_here:

db.similarity_search("thing", query_properties=["text_goes_here", "metadata1^2", "metadata2"])

if you don’t provide a query_properties at query time, Weaviate will look into all searchable properties.

More on setting hybrid query weights on property values here:

Let me know if this helps!

Yes, it definitely helps. Now I understand that the ‘page_content’ field is silently converted to ‘text’ file after the DB is created and I think that clears up my confusion.

1 Like