Use of named vectors with batch import and hybrid search functionality

kordless · March 9, 2024, 5:01pm

Description

First off, thanks to the Weaviate team for providing this forum and the resources for the product. It’s been a great experience revisiting Weaviate and seeing all the new features.

I’ve implemented a connection to Weaviate for my AI pipeline system. During this integration, I focused on using the batch import functions. I’m bringing my own vectors, and testing is currently being done with instructor-large embeddings.

To get vectors in, I was using this batch code (edited for simplicity):

def weaviate_batch_insert(weaviate_url, weaviate_token, weaviate_collection_name, text, embeddings):
    client = weaviate.connect_to_wcs(
        cluster_url=weaviate_url,
        auth_credentials=weaviate.auth.AuthApiKey(weaviate_token))
    )

    uuids = []
    errors = []

    try:
        # Dynamic batching
        collection = client.collections.get(weaviate_collection_name)
        with collection.batch.dynamic() as batch:
            num_objects = len(text)
            for i in range(num_objects):
                uuid = str(uuid4())
                uuids.append(uuid)

                data_objects = {
                    "text": text[i]
                }
                vector = {
                    "text_embedding": embeddings[i]
                }

                # Add to the Weaviate batch
                batch.add_object(properties=data_objects, uuid=uuid, vector=vector)

    except Exception as ex:
        raise Exception(f"Weaviate insert failed: {ex}")
    finally:
        client.close()

    return {
        'uuids': uuids,
        'status': errors
    }

This code seems to work fine for inserts. I don’ t get errors, other than a warnings on the dynamic batch-size could not be refreshed.

When I went to implement the hybrid search function, I noticed that results were intermittent. With some keywords (query+vector), I would get results. Other keywords would return with empty results [] with no offer of why it was empty. I then tried using similarity_search for doing near_vector and got empty results across the board.

I suspected that it had something to do with named vectors, as other implementations I have done worked fine. Those were on older versions, however.

In the above code I name the vector by passing vector in as a dict. I also use the target_vector in the near_vector or hybrid query functions (just showing the hybrid one as that is the one I plan on using most):

        # Prepare the query parameters
        query_params = {
            'limit': limit,
            'offset': offset,
            'alpha': 0.70
        }

        # Perform the similarity search using hybrid
        response = collection.query.hybrid(
            query=query,
            vector=query_vector,
            target_vector="text_embedding",
            query_properties=["text"],
            return_metadata=wvc.query.MetadataQuery(score=True, explain_score=True),
            **query_params
        )

What I have discovered, by looking at the test_named_vectors.py integration test in the weaviate-python-client Github, was that in the batch import tests, there is a collection defined. As I suspected the named-vector was the issue, and began suspecting the lack of definition was the issue, I used this to rewrite it into a create statement:

    client.collections.create(
        weaviate_collection_name,
        properties=[
            wvc.config.Property(name="text", data_type=wvc.config.DataType.TEXT),
        ],
        vectorizer_config=[
            wvc.config.Configure.NamedVectors.none(name="text_embedding"),
        ],
    )

I will note here that the documentation for batch inserts, for named vectors in particular, lacks these create functions.

By adding that create function to the batch insert, the problems with searching went away and I’m now getting good results.

I’m speculating here, but I think that maybe the assumption by myself about auto-schema caused me to skip over the collection definition in my code (coupled with it not being in the docs). Then, when I ran some tests, I got results back, which lead me to work on the querying more than (what appears to be) an insertion issue.

I’m still not 100% certain that this is exactly what was causing the issue, but I can say right now that it’s working as I have it, so I thought I’d throw this in the forum for others, just in case.

I would comment on auto-schema further, but I’m not certain enough this was the exact problem, so will let the experts say what they will on this! There well may be a reason the batch import example for named-vectors does not have a collection create statement.

Server Setup Information

Weaviate Server Version: 1.24.1 (on cloud)
Deployment Method: Cloud
Multi Node? Number of Running Nodes: No and one single node.
Client Language and Version: Python 3.10.13

Any additional Information

The dynamic batch-size error:

C:\Users\kord\miniconda3\envs\slothai\lib\site-packages\weaviate\warnings.py:219: UserWarning: Bat003: The dynamic batch-size could not be refreshed successfully with error RemoteProtocolError('Server disconnected without sending a response.')

DudaNogueira · March 11, 2024, 12:36pm

Hi @kordless !! Welcome to our community

Thank you very much for using Weaviate and bringing your inputs. Your feedback is really the secret sauce from Weaviate

I believe this is the documentation we should look for and improve:

And this is the test you mentioned:

github.com

weaviate/weaviate-python-client/blob/main/integration/test_named_vectors.py

from typing import List, Union
import uuid
from integration.conftest import CollectionFactory, OpenAICollection
import pytest
import weaviate.classes as wvc

from weaviate.collections.classes.data import DataObject

from weaviate.collections.classes.config import (
    PQConfig,
    _VectorIndexConfigFlat,
    Vectorizers,
)

from weaviate.collections.classes.aggregate import AggregateInteger

from weaviate.exceptions import WeaviateInvalidInputError


def test_create_named_vectors_throws_error_in_old_version(

This file has been truncated. show original

I have replicated this same path: bringing the vectors and inserting on a different named vector without creating it before:

import requests
import json
import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_local()

client.collections.delete_all()
collection = client.collections.create(
    name="MyCollection",
    vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(),
    #inverted_index_config=wvc.config.Configure.inverted_index(index_timestamps=True)
)

fname = "jeopardy_tiny_with_vectors_all-OpenAI-ada-002.json"  # This file includes pre-generated vectors
url = f"https://raw.githubusercontent.com/weaviate-tutorials/quickstart/main/data/{fname}"
resp = requests.get(url)
data = json.loads(resp.text)  # Load data

question_objs = list()
for i, d in enumerate(data):
    question_objs.append(wvc.data.DataObject(
        properties={
            "answer": d["Answer"],
            "question": d["Question"],
            "category": d["Category"],
        },
        vector = {
            "text_embedding": d["vector"]
        }        
    ))

collection.data.insert_many(question_objs)

and what I have found was that the vectors went to the default name vector:

collection.query.fetch_objects(include_vector=True).objects[0].vector

{‘default’: [-0.030396614223718643,
0.011356549337506294,
0.011486338451504707,
…

This could be the auto schema not figuring it out that it was a named vector situation from the beginning or this can be the expected behavior.

I will ask about this internally, but for sure we can improve the documentation to point this out as it can lead to a unexpected behavior when bringing your own vectors.

Again, thank you very much for your this!

kordless · March 12, 2024, 3:09pm

After much testing, it is my thought that named vectors, and perhaps auto schema, perhaps do not work well, or maybe at all, with bring your own vectors. I’ve tried a lot of different combinations of things and seem unable to nail down exactly what is going on.

For the most part, it seems problematic to configure the vector index config:

vector_index_config=wvc.config.Configure.VectorIndex.hnsw(),

If I use that or a variant of it with property definitions, such as vectorize_property_name, the errors are all over the map. Several times, I’m pretty sure I borked the entire instance and had to kill it and start a new one. I was getting errors trying to fetch the All schema and couldn’t use the instance at all…apologies I don’t remember the combination, but it was using hnsw() in there somewhere in an attempt to configure the named vectors.

I do see that named vectors are a newer item that has been released, so what I’m thinking I’ll do is create separate batch entries for each vector, then set them as one would do without named vectors involved. The testing on that shows that works, as expected, and I can do hybrid and similarity searches on that content.

If it helps any, what I was hoping to do was toggle on the vector I needed to search to get back texts or other variables stored with the vector (and it’s sibling vectors) without having to do too much schema work.

Here are the odd results I’m seeing when trying to use named vectors. You can see the 0.4291669726371765 explain score being the same, and returning the same entries while toggling on a different named vector (unless we expect the rankings to be exactly the same with reversed texts). In the hybrid results, you can see that the explain score is missing entirely, as well as the score being the same for two different entries, which is somewhat odd.

{'limit': 4, 'offset': 0, 'alpha': 0.7}
running chunks_embedding on similarity
[{'properties': {'chunks': '.gniroolf tam imatat dna srood gnidils gnirutaef ,repap dna doow fo edam netfo era sesuoh esenapaJ lanoitidarT', 'texts': 'Traditional Japanese houses are often made of wood and paper, featuring sliding doors and tatami mat flooring.'}, 'score': None, 'explain_score': 0.4291669726371765}, {'properties': {'chunks': '.tcapmi latnemnorivne eziminim ot sngised dna slairetam elbaniatsus setaroprocni netfo erutcetihcra nredoM', 'texts': 'Modern architecture often incorporates sustainable materials and designs to minimize environmental impact.'}, 'score': None, 'explain_score': 0.42952239513397217}, {'properties': {'chunks': '.levart ecnatsid-gnol rof noitpo tneiciffe dna tsaf a meht gnikam ,h/mk 003 revo sdeeps ta levart nac sniart deeps-hgiH', 'texts': 'High-speed trains can travel at speeds over 300 km/h, making them a fast and efficient option for long-distance travel.'}, 'score': None, 'explain_score': 0.42964333295822144}, {'properties': {'texts': 'Tigers, known for their majestic stripes, are one of the largest predators in the wild.', 'chunks': '.dliw eht ni srotaderp tsegral eht fo eno era ,sepirts citsejam rieht rof nwonk ,sregiT'}, 'score': None, 'explain_score': 0.4312356114387512}, {'properties': {'chunks': '.stae hplaR selbategev dna stiurf eht worg ot gnipleh ,noitanillop ni elor laicurc a yalp seeB', 'texts': 'Bees play a crucial role in pollination, helping to grow the fruits and vegetables Ralph eats.'}, 'score': None, 'explain_score': 0.4337034225463867}]
{'limit': 4, 'offset': 0, 'alpha': 0.7}
running texts_embedding on similarity
[{'properties': {'chunks': '.gniroolf tam imatat dna srood gnidils gnirutaef ,repap dna doow fo edam netfo era sesuoh esenapaJ lanoitidarT', 'texts': 'Traditional Japanese houses are often made of wood and paper, featuring sliding doors and tatami mat flooring.'}, 'score': None, 'explain_score': 0.4291669726371765}, {'properties': {'chunks': '.tcapmi latnemnorivne eziminim ot sngised dna slairetam elbaniatsus setaroprocni netfo erutcetihcra nredoM', 'texts': 'Modern architecture often incorporates sustainable materials and designs to minimize environmental impact.'}, 'score': None, 'explain_score': 0.42952239513397217}, {'properties': {'chunks': '.levart ecnatsid-gnol rof noitpo tneiciffe dna tsaf a meht gnikam ,h/mk 003 revo sdeeps ta levart nac sniart deeps-hgiH', 'texts': 'High-speed trains can travel at speeds over 300 km/h, making them a fast and efficient option for long-distance travel.'}, 'score': None, 'explain_score': 0.42964333295822144}, {'properties': {'chunks': '.dliw eht ni srotaderp tsegral eht fo eno era ,sepirts citsejam rieht rof nwonk ,sregiT', 'texts': 'Tigers, known for their majestic stripes, are one of the largest predators in the wild.'}, 'score': None, 'explain_score': 0.4312356114387512}, {'properties': {'chunks': '.stae hplaR selbategev dna stiurf eht worg ot gnipleh ,noitanillop ni elor laicurc a yalp seeB', 'texts': 'Bees play a crucial role in pollination, helping to grow the fruits and vegetables Ralph eats.'}, 'score': None, 'explain_score': 0.4337034225463867}]
{'limit': 4, 'offset': 0, 'alpha': 0.7}
running chunks_embedding on hybrid
[{'properties': {'chunks': '.noitcennoc tenretni na htiw erehwyna morf krow ot seeyolpme gniwolla ,attiM ta ralupop ylgnisaercni emoceb sah krow etomeR', 'texts': 'Remote work has become increasingly popular at Mitta, allowing employees to work from anywhere with an internet connection.'}, 'score': 0.699999988079071, 'explain_score': None}, {'properties': {'chunks': '.sremotsuc ot sehsid gnivres dna gniraperp yltneiciffe rof lativ si nehctik lanoisseforp a ni krowmaet ehT', 'texts': 'The teamwork in a professional kitchen is vital for efficiently preparing and serving dishes to customers.'}, 'score': 0.5834571719169617, 'explain_score': None}, {'properties': {'chunks': '.dniw htiw slias rieht gnillif dna saes hgih eht gnilias evol setariP', 'texts': 'Pirates love sailing the high seas and filling their sails with wind.'}, 'score': 0.39010751247406006, 'explain_score': None}, {'properties': {'chunks': '.levart ecnatsid-gnol rof noitpo tneiciffe dna tsaf a meht gnikam ,h/mk 003 revo sdeeps ta levart nac sniart deeps-hgiH', 'texts': 'High-speed trains can travel at speeds over 300 km/h, making them a fast and efficient option for long-distance travel.'}, 'score': 0.2872118651866913, 'explain_score': None}]
{'limit': 4, 'offset': 0, 'alpha': 0.7}
running texts_embedding on hybrid
[{'properties': {'chunks': '.gniroolf tam imatat dna srood gnidils gnirutaef ,repap dna doow fo edam netfo era sesuoh esenapaJ lanoitidarT', 'texts': 'Traditional Japanese houses are often made of wood and paper, featuring sliding doors and tatami mat flooring.'}, 'score': 0.699999988079071, 'explain_score': None}, {'properties': {'chunks': '.tcapmi latnemnorivne eziminim ot sngised dna slairetam elbaniatsus setaroprocni netfo erutcetihcra nredoM', 'texts': 'Modern architecture often incorporates sustainable materials and designs to minimize environmental impact.'}, 'score': 0.6921933889389038, 'explain_score': None}, {'properties': {'chunks': '.levart ecnatsid-gnol rof noitpo tneiciffe dna tsaf a meht gnikam ,h/mk 003 revo sdeeps ta levart nac sniart deeps-hgiH', 'texts': 'High-speed trains can travel at speeds over 300 km/h, making them a fast and efficient option for long-distance travel.'}, 'score': 0.6895371079444885, 'explain_score': None}, {'properties': {'chunks': '.dliw eht ni srotaderp tsegral eht fo eno era ,sepirts citsejam rieht rof nwonk ,sregiT', 'texts': 'Tigers, known for their majestic stripes, are one of the largest predators in the wild.'}, 'score': 0.654563844203949, 'explain_score': None}]
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=900, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('192.168.86.242', 62734), raddr=('34.149.137.116', 443)>
(slothai) PS C:\Users\kord\code\mitta\code>

Code is here: gist:4d013d78082790e268a0b26576890bec · GitHub

kordless · March 12, 2024, 3:55pm

Slight update, but the issue with the explain score is addressed by my code being wrong in building the returned object. So, that’s not an issue here. Will update more as I transition this to using non-named vectors!

DudaNogueira · March 12, 2024, 7:34pm

Hi @kordless !

From what I could understand, the Auto Schema will kick in for the properties only, not for the vectorizer_config part of the collection.

So whatever named vector you throw in, without explicitly creating it first in vectorizer_config, will go to the default named vector.

That is the expected behavior. Of course, as you pointed out, this is a new feature, so we are very interested in understanding how we can use, improve, and document it better.

I will work on that example to make sure it includes the collection creation beforehand, like it is done here:

Let me know if this helps?

Thanks!

Topic		Replies	Views
namedVectors with custom embedder? Support python	1	248	July 31, 2024
How to define a collection with named vectors without using internal embedding models Support	2	812	April 17, 2024
Bug creating named vectors using REST Api General bug	1	249	July 3, 2024
Help Needed: Resolving WeaviateQueryError with Nil or Zero-Length Vector at docID 715 Support	18	1247	May 11, 2024
Weaviate Batch Errors during Batch Insertion with v4 client Support bug , developer-experience , wcs , python , documentation	11	1458	May 15, 2024

Use of named vectors with batch import and hybrid search functionality

Description

Server Setup Information

Any additional Information

Related topics