Multimodal search with Bring your own vector

Hi, we are trying to create a schema which will support multimodal search where user can use text queries but needs to do semantic search across columns containing text or vectors.

Below is the schema where image_embeddings is a Bring your own vector column where we will generate the embeddings for a imageand dont want weaviate to create vectors, but this needs to be part of multimodal search with other fields like filename, tags, mime_type. Please provide the correct way to define schema for this multimodal search with Bring your own vector?

client.collections.create(
    name="SemanticSchema",  # The name of the collection ('NV' for named vectors)
    properties=[
        wc.Property(name="lcid", data_type=wc.DataType.TEXT),
        wc.Property(name="checksum", data_type=wc.DataType.TEXT),
        wc.Property(name="filename", data_type=wc.DataType.TEXT),
        wc.Property(name="tags", data_type=wc.DataType.TEXT),
        wc.Property(name="mime_type", data_type=wc.DataType.TEXT),
        wc.Property(name="person_names", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="location", data_type=wc.DataType.TEXT),
        wc.Property(name="image_embeddings", data_type=wc.DataType.NUMBER_ARRAY),
    ],
    # Define & configure the vectorizer module
    vectorizer_config=[

        wc.Configure.NamedVectors.multi2vec_clip(
            name="filename", text_fields=["filename"]
        ),

        wc.Configure.NamedVectors.multi2vec_clip(
            name="tags", text_fields=["tags"]
        ),

        wc.Configure.NamedVectors.multi2vec_clip(
            name="mime_type", text_fields=["mime_type"]
        ),

        wc.Configure.NamedVectors.multi2vec_clip(
            name="location", text_fields=["location"]
        ),
 
        wc.Configure.NamedVectors.multi2vec_clip(
            name="image_filename_tags",
            image_fields=[
                wc.Multi2VecField(name="image_embeddings")
            ],  # 90% of the vector is from the poster
            text_fields=[
                wc.Multi2VecField(name="filename"),
                wc.Multi2VecField(name="tags"),
                wc.Multi2VecField(name="mime_type"),
                wc.Multi2VecField(name="location"),
            ],  # 10% of the vector is from the title
        ),
    ],
    # Define the generative module
    #generative_config=wc.Configure.Generative.openai(),

    # Add sharding configuration
    sharding_config=Configure.sharding(
       virtual_per_physical=128,
       desired_count=2,
       desired_virtual_count=128,
    ),
    replication_config=Configure.replication(
        factor=2,
        async_enabled=True,
    ),
)

hi @Krishna_C !!

Welcome back to our forums :hugs:

There are some problems with this schema, and some couple of things you must understand on how our multi2vec_clip works so you can make the most out of it.

First, I see you may be storing the image embeddings as a property. This is far from optimal, as the inverted index will create keyword indexes for your vectors

That will lead to unnecessary memory usage, as you donā€™t want to perform keyword searches on your vector dimensions :crazy_face:

If you really want to do that, you need to make sure to specify that this property shouldnā€™t be searchable nor filterable, like so:

        wvc.config.Property(
            name="image_embeddings", 
            data_type=wc.DataType.NUMBER_ARRAY, 
            index_filterable=False, 
            index_searchable=False
        ),

The second issue I see is this:

            image_fields=[
                wc.Multi2VecField(name="image_embeddings")
            ],

And the problem here is that the image field must be a blob, and not the image vector, as we state in our docs:

Now, the way that our multi2vec_clip works is that Weaviate will generate a vector for all texts fields and image fields, and combine those vectors.

So you may not need to have one named vector for each of the properties isolated. If you have one vector for all text and image, that vector will ve a representation of those content.

Let me know if this clarifies and how I can help you further.

Thanks!

thanks @DudaNogueira ! I still have below questions- pls clarify

in weavite, in below schema if i use below schema and later set the vector field as a bring your own vector where i am using SentenceTransformer(ā€˜clip-ViT-B-32ā€™) model then while searching for the text query, should i vectorize the query (query=ā€œbuttonā€) before searching so that it can perform serach across the dfferent vectors in the schema?
Will the result be a union of matches among different vectors (tags, multi2vec_clip vectors defined and the vector field set while data import - vector=image_embeddings.tolist(),)?

search code:

semanticSchema = client.collections.get(ā€œSemanticSchemaā€)
response = semanticSchema.query.hybrid(
query=ā€œbuttonā€,
target_vector=ā€œimage_filename_tagsā€,
return_metadata=MetadataQuery(score=True, explain_score=True),
limit=3,
)

schema definition:
client.collections.create(
name=ā€œSemanticSchema_BYOVā€, # The name of the collection (ā€˜NVā€™ for named vectors)
properties=[
wc.Property(name=ā€œlcidā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œchecksumā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œfilenameā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œtagsā€, data_type=wc.DataType.TEXT, vectorizer=ā€œmulti2vec-clipā€),
wc.Property(name=ā€œmime_typeā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œperson_namesā€, data_type=wc.DataType.TEXT_ARRAY, vectorizer=ā€œmulti2vec-clipā€),
wc.Property(name=ā€œlocationā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œimage_embeddingsā€, data_type=wc.DataType.NUMBER_ARRAY,index_filterable=False,index_searchable=False),
],
# Define & configure the vectorizer module
# Define the vectorizer module (none, as we will add our own vectors)
# Define & configure the vectorizer module
vectorizer_config=wc.Configure.Vectorizer.multi2vec_clip(
text_fields=[
wc.Multi2VecField(name=ā€œfilenameā€),
wc.Multi2VecField(name=ā€œtagsā€),
wc.Multi2VecField(name=ā€œmime_typeā€),
wc.Multi2VecField(name=ā€œlocationā€),
]),
# Define the generative module
#generative_config=wc.Configure.Generative.openai(),

    # Add sharding configuration
    sharding_config=Configure.sharding(
       virtual_per_physical=128,
       desired_count=2,
       desired_virtual_count=128,
    ),
    replication_config=Configure.replication(
        factor=2,
        async_enabled=True,
    ),
)

Data insert of Bring Your own vector :

Define the function to generate embeddings using CLIP
def generateEmbeddingsForImage(img_path):
model = SentenceTransformer(ā€˜clip-ViT-B-32ā€™)
image = Image.open(img_path).convert(ā€˜RGBā€™)
embeddings = model.encode(image)
return embeddings

Insert the single record into Weaviate
data_object = {
ā€œlcidā€: ā€œa510a7badc1849eb997555073e3953fe1ā€,
ā€œchecksumā€: ā€œchecksum1ā€,
ā€œfilenameā€: ā€œAlbert-Einsteinā€,
ā€œmime_typeā€: ā€œjpgā€,
ā€œperson_namesā€: [ā€œAlbert Einsteinā€],
ā€œtagsā€: json.dumps([
{ā€œnameā€: ā€œhuman faceā€, ā€œconfidenceā€: 0.9927250146865845},
{ā€œnameā€: ā€œclothingā€, ā€œconfidenceā€: 0.9834420680999756},
{ā€œnameā€: ā€œpersonā€, ā€œconfidenceā€: 0.9827311038970947},
{ā€œnameā€: ā€œwrinkleā€, ā€œconfidenceā€: 0.9579571485519409},
{ā€œnameā€: ā€œportraitā€, ā€œconfidenceā€: 0.954142689704895},
{ā€œnameā€: ā€œforeheadā€, ā€œconfidenceā€: 0.924140453338623},
{ā€œnameā€: ā€œchinā€, ā€œconfidenceā€: 0.9210121631622314},
{ā€œnameā€: ā€œsenior citizenā€, ā€œconfidenceā€: 0.8994851112365723},
{ā€œnameā€: ā€œhumanā€, ā€œconfidenceā€: 0.8762668371200562},
{ā€œnameā€: ā€œjawā€, ā€œconfidenceā€: 0.8676725625991821},
{ā€œnameā€: ā€œindoorā€, ā€œconfidenceā€: 0.758019089698791}
]),
ā€œlocationā€: ā€œGermanyā€,
ā€œimage_embeddingsā€: image_embeddings.tolist() # Convert embeddings to a list for insertion
}

Insert the object into Weaviate under the ā€œSemanticSchemaā€ class
try:
Generate image embeddings for the provided image
image_embeddings = generateEmbeddingsForImage(ā€œ./pics/Albert-Einstein.jpgā€)

semanticSchema = client.collections.get("SemanticSchema_BYOV")
uuid = semanticSchema.data.insert(
    properties=data_object,
    **vector=image_embeddings.tolist(),**
    uuid=generate_uuid5(data_object),
)
print("Data successfully inserted into Weaviate.")

hi!

If you define the vectorizer properly, you can still provide your own vector, and use the near_text.

What will happen ā€œunder the hoodā€ is that Weaviate will vectorize your query.

If you do not define the vectorizer for your named vector, Weaviate has no way on vectorizing your data.

So you will need to vectorize your query too, and provide it when querying, so instead of query, you will use the query_vector.

Let me know if this helps.

But how to search over my Bring your own vector and also the named vectors in a single search query?

@DudaNogueira pls advise how to modify the below schema and perform a hybrid search for the Bring your own vector scenario; Also if i have to have another field like tags for which i need to vectorize usign weaviate vectorizer , how to define the schema and search ? Basically i need a
a) image_embeddings field that needs to store Bring Your Own vector and
b) another vector on a combination of one or more fields like (filename, tags)

**AND **
c) i should be able to search across these vectors like a multi vector search or hybrid search. pls provide altered schema and data insert and search code?

Schema:

client.collections.create(
name=ā€œSemanticSchema_BYOV1ā€, # The name of the collection (ā€˜NVā€™ for named vectors)
properties=[
wc.Property(name=ā€œlcidā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œchecksumā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œfilenameā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œtagsā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œmime_typeā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œperson_namesā€, data_type=wc.DataType.TEXT_ARRAY),
wc.Property(name=ā€œlocationā€, data_type=wc.DataType.TEXT),
wc.Property(name=ā€œimage_embeddingsā€, data_type=wc.DataType.NUMBER_ARRAY, vectorizer_config=Configure.Vectorizer.none),
],
# Configure the multi2vec-clip vectorizer for text fields
vectorizer_config=Configure.Vectorizer.multi2vec_clip(
text_fields=[
wc.Multi2VecField(name=ā€œfilenameā€),
wc.Multi2VecField(name=ā€œtagsā€),
wc.Multi2VecField(name=ā€œmime_typeā€),
wc.Multi2VecField(name=ā€œlocationā€)
]
),
# Define the generative module
#generative_config=wc.Configure.Generative.openai(),

    # Add sharding configuration
    sharding_config=Configure.sharding(
       virtual_per_physical=128,
       desired_count=3,
       desired_virtual_count=128,
    ),
    replication_config=Configure.replication(
        factor=2,
        async_enabled=True,
    ),
)

Insert Data code:

Insert the single record into Weaviate

     data_object = {
        "lcid": lcid,
        "checksum": checksum,
        "filename": file_name,
        "mime_type": mine_type,
        "person_names": ["person"],
        "tags": json.dumps(tags),
        "location": location,
        "image_embeddings": image_embeddings  # Bring your own vector
     }

Add object to batch queue

     uuid = semanticSchema.data.insert(
         properties=data_object,
         uuid=generate_uuid5(data_object),
     )

@DudaNogueira pls provide schema for abv ? I am unable to find a solution for a combination of Bring your own vector and fileds which use wvt vectorizers. I need to make a presentation today. Your inputs will help me! Thanks!

hi @Krishna_C !

Sorry for the delay.

This is how you would insert and query a named vector with bring your own vector:

# create the collection
client.collections.delete("NamedVectorCollection")
collection = client.collections.create(
    name="NamedVectorCollection",
    vectorizer_config=[
        wvc.config.Configure.NamedVectors.none(name="text_vector"),
        wvc.config.Configure.NamedVectors.none(name="title_vector")
    ],
    properties=[
        wvc.config.Property(
            name="text",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=True
        ),
        wvc.config.Property(
            name="title",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=True
        ),
    ]
)
# now we insert data
collection.data.insert({
        "text": "this is a text",
        "title": "this is a title"
    },
    vector={
        "text_vector": [1,2,3,4,5],
        "title_vector": [1,2,3,4,5,6,7,8,9,10]
    }
)
# now we query
query = collection.query.near_vector(
    target_vector=["text_vector"],
    near_vector=[5,4,3,2,1],
    return_metadata=wvc.query.MetadataQuery(distance=True)
)
print(query.objects[0].properties)
print(query.objects[0].metadata.distance)

This was my output:

{ā€˜textā€™: ā€˜this is a textā€™, ā€˜titleā€™: ā€˜this is a titleā€™}
0.3636362552642822

Let me know if that helps!

Thanks for your response @DudaNogueira ! Could you pls provide a code for Multivector search with bring your own vector field, along with a multivector which use multi2vec vectorizor. The search should happen across these vecors. Is this possible? The Bring Your own vector should not be part of the Multi2vecclip config. The search should be acorss 2 or more vectors. One of them is Bringyourownvector And ((multi2vec) Or Other named vector)