Multimodal search with Bring your own vector

Krishna_C · October 4, 2024, 2:49am

Hi, we are trying to create a schema which will support multimodal search where user can use text queries but needs to do semantic search across columns containing text or vectors.

Below is the schema where image_embeddings is a Bring your own vector column where we will generate the embeddings for a imageand dont want weaviate to create vectors, but this needs to be part of multimodal search with other fields like filename, tags, mime_type. Please provide the correct way to define schema for this multimodal search with Bring your own vector?

client.collections.create(
    name="SemanticSchema",  # The name of the collection ('NV' for named vectors)
    properties=[
        wc.Property(name="lcid", data_type=wc.DataType.TEXT),
        wc.Property(name="checksum", data_type=wc.DataType.TEXT),
        wc.Property(name="filename", data_type=wc.DataType.TEXT),
        wc.Property(name="tags", data_type=wc.DataType.TEXT),
        wc.Property(name="mime_type", data_type=wc.DataType.TEXT),
        wc.Property(name="person_names", data_type=wc.DataType.TEXT_ARRAY),
        wc.Property(name="location", data_type=wc.DataType.TEXT),
        wc.Property(name="image_embeddings", data_type=wc.DataType.NUMBER_ARRAY),
    ],
    # Define & configure the vectorizer module
    vectorizer_config=[

        wc.Configure.NamedVectors.multi2vec_clip(
            name="filename", text_fields=["filename"]
        ),

        wc.Configure.NamedVectors.multi2vec_clip(
            name="tags", text_fields=["tags"]
        ),

        wc.Configure.NamedVectors.multi2vec_clip(
            name="mime_type", text_fields=["mime_type"]
        ),

        wc.Configure.NamedVectors.multi2vec_clip(
            name="location", text_fields=["location"]
        ),
 
        wc.Configure.NamedVectors.multi2vec_clip(
            name="image_filename_tags",
            image_fields=[
                wc.Multi2VecField(name="image_embeddings")
            ],  # 90% of the vector is from the poster
            text_fields=[
                wc.Multi2VecField(name="filename"),
                wc.Multi2VecField(name="tags"),
                wc.Multi2VecField(name="mime_type"),
                wc.Multi2VecField(name="location"),
            ],  # 10% of the vector is from the title
        ),
    ],
    # Define the generative module
    #generative_config=wc.Configure.Generative.openai(),

    # Add sharding configuration
    sharding_config=Configure.sharding(
       virtual_per_physical=128,
       desired_count=2,
       desired_virtual_count=128,
    ),
    replication_config=Configure.replication(
        factor=2,
        async_enabled=True,
    ),
)

DudaNogueira · October 4, 2024, 8:48am

hi @Krishna_C !!

Welcome back to our forums

There are some problems with this schema, and some couple of things you must understand on how our multi2vec_clip works so you can make the most out of it.

First, I see you may be storing the image embeddings as a property. This is far from optimal, as the inverted index will create keyword indexes for your vectors

That will lead to unnecessary memory usage, as you don’t want to perform keyword searches on your vector dimensions

If you really want to do that, you need to make sure to specify that this property shouldn’t be searchable nor filterable, like so:

        wvc.config.Property(
            name="image_embeddings", 
            data_type=wc.DataType.NUMBER_ARRAY, 
            index_filterable=False, 
            index_searchable=False
        ),

The second issue I see is this:

            image_fields=[
                wc.Multi2VecField(name="image_embeddings")
            ],

And the problem here is that the image field must be a blob, and not the image vector, as we state in our docs:

Now, the way that our multi2vec_clip works is that Weaviate will generate a vector for all texts fields and image fields, and combine those vectors.

So you may not need to have one named vector for each of the properties isolated. If you have one vector for all text and image, that vector will ve a representation of those content.

Let me know if this clarifies and how I can help you further.

Thanks!

Krishna_C · October 7, 2024, 6:49am

thanks @DudaNogueira ! I still have below questions- pls clarify

in weavite, in below schema if i use below schema and later set the vector field as a bring your own vector where i am using SentenceTransformer(‘clip-ViT-B-32’) model then while searching for the text query, should i vectorize the query (query=“button”) before searching so that it can perform serach across the dfferent vectors in the schema?
Will the result be a union of matches among different vectors (tags, multi2vec_clip vectors defined and the vector field set while data import - vector=image_embeddings.tolist(),)?

search code:

semanticSchema = client.collections.get(“SemanticSchema”)
response = semanticSchema.query.hybrid(
query=“button”,
target_vector=“image_filename_tags”,
return_metadata=MetadataQuery(score=True, explain_score=True),
limit=3,
)

schema definition:
client.collections.create(
name=“SemanticSchema_BYOV”, # The name of the collection (‘NV’ for named vectors)
properties=[
wc.Property(name=“lcid”, data_type=wc.DataType.TEXT),
wc.Property(name=“checksum”, data_type=wc.DataType.TEXT),
wc.Property(name=“filename”, data_type=wc.DataType.TEXT),
wc.Property(name=“tags”, data_type=wc.DataType.TEXT, vectorizer=“multi2vec-clip”),
wc.Property(name=“mime_type”, data_type=wc.DataType.TEXT),
wc.Property(name=“person_names”, data_type=wc.DataType.TEXT_ARRAY, vectorizer=“multi2vec-clip”),
wc.Property(name=“location”, data_type=wc.DataType.TEXT),
wc.Property(name=“image_embeddings”, data_type=wc.DataType.NUMBER_ARRAY,index_filterable=False,index_searchable=False),
],
# Define & configure the vectorizer module
# Define the vectorizer module (none, as we will add our own vectors)
# Define & configure the vectorizer module
vectorizer_config=wc.Configure.Vectorizer.multi2vec_clip(
text_fields=[
wc.Multi2VecField(name=“filename”),
wc.Multi2VecField(name=“tags”),
wc.Multi2VecField(name=“mime_type”),
wc.Multi2VecField(name=“location”),
]),
# Define the generative module
#generative_config=wc.Configure.Generative.openai(),

    # Add sharding configuration
    sharding_config=Configure.sharding(
       virtual_per_physical=128,
       desired_count=2,
       desired_virtual_count=128,
    ),
    replication_config=Configure.replication(
        factor=2,
        async_enabled=True,
    ),
)

Data insert of Bring Your own vector :

Define the function to generate embeddings using CLIP
def generateEmbeddingsForImage(img_path):
model = SentenceTransformer(‘clip-ViT-B-32’)
image = Image.open(img_path).convert(‘RGB’)
embeddings = model.encode(image)
return embeddings

Insert the single record into Weaviate
data_object = {
“lcid”: “a510a7badc1849eb997555073e3953fe1”,
“checksum”: “checksum1”,
“filename”: “Albert-Einstein”,
“mime_type”: “jpg”,
“person_names”: [“Albert Einstein”],
“tags”: json.dumps([
{“name”: “human face”, “confidence”: 0.9927250146865845},
{“name”: “clothing”, “confidence”: 0.9834420680999756},
{“name”: “person”, “confidence”: 0.9827311038970947},
{“name”: “wrinkle”, “confidence”: 0.9579571485519409},
{“name”: “portrait”, “confidence”: 0.954142689704895},
{“name”: “forehead”, “confidence”: 0.924140453338623},
{“name”: “chin”, “confidence”: 0.9210121631622314},
{“name”: “senior citizen”, “confidence”: 0.8994851112365723},
{“name”: “human”, “confidence”: 0.8762668371200562},
{“name”: “jaw”, “confidence”: 0.8676725625991821},
{“name”: “indoor”, “confidence”: 0.758019089698791}
]),
“location”: “Germany”,
“image_embeddings”: image_embeddings.tolist() # Convert embeddings to a list for insertion
}

Insert the object into Weaviate under the “SemanticSchema” class
try:
Generate image embeddings for the provided image
image_embeddings = generateEmbeddingsForImage(“./pics/Albert-Einstein.jpg”)

semanticSchema = client.collections.get("SemanticSchema_BYOV")
uuid = semanticSchema.data.insert(
    properties=data_object,
    **vector=image_embeddings.tolist(),**
    uuid=generate_uuid5(data_object),
)
print("Data successfully inserted into Weaviate.")

DudaNogueira · October 7, 2024, 9:16am

hi!

If you define the vectorizer properly, you can still provide your own vector, and use the near_text.

What will happen “under the hood” is that Weaviate will vectorize your query.

If you do not define the vectorizer for your named vector, Weaviate has no way on vectorizing your data.

So you will need to vectorize your query too, and provide it when querying, so instead of query, you will use the query_vector.

Let me know if this helps.

Krishna_C · October 7, 2024, 10:42am

But how to search over my Bring your own vector and also the named vectors in a single search query?

Krishna_C · October 7, 2024, 1:41pm

@DudaNogueira pls advise how to modify the below schema and perform a hybrid search for the Bring your own vector scenario; Also if i have to have another field like tags for which i need to vectorize usign weaviate vectorizer , how to define the schema and search ? Basically i need a
a) image_embeddings field that needs to store Bring Your Own vector and
b) another vector on a combination of one or more fields like (filename, tags)
**AND **
c) i should be able to search across these vectors like a multi vector search or hybrid search. pls provide altered schema and data insert and search code?

Schema:

client.collections.create(
name=“SemanticSchema_BYOV1”, # The name of the collection (‘NV’ for named vectors)
properties=[
wc.Property(name=“lcid”, data_type=wc.DataType.TEXT),
wc.Property(name=“checksum”, data_type=wc.DataType.TEXT),
wc.Property(name=“filename”, data_type=wc.DataType.TEXT),
wc.Property(name=“tags”, data_type=wc.DataType.TEXT),
wc.Property(name=“mime_type”, data_type=wc.DataType.TEXT),
wc.Property(name=“person_names”, data_type=wc.DataType.TEXT_ARRAY),
wc.Property(name=“location”, data_type=wc.DataType.TEXT),
wc.Property(name=“image_embeddings”, data_type=wc.DataType.NUMBER_ARRAY, vectorizer_config=Configure.Vectorizer.none),
],
# Configure the multi2vec-clip vectorizer for text fields
vectorizer_config=Configure.Vectorizer.multi2vec_clip(
text_fields=[
wc.Multi2VecField(name=“filename”),
wc.Multi2VecField(name=“tags”),
wc.Multi2VecField(name=“mime_type”),
wc.Multi2VecField(name=“location”)
]
),
# Define the generative module
#generative_config=wc.Configure.Generative.openai(),

    # Add sharding configuration
    sharding_config=Configure.sharding(
       virtual_per_physical=128,
       desired_count=3,
       desired_virtual_count=128,
    ),
    replication_config=Configure.replication(
        factor=2,
        async_enabled=True,
    ),
)

Insert Data code:

Insert the single record into Weaviate

     data_object = {
        "lcid": lcid,
        "checksum": checksum,
        "filename": file_name,
        "mime_type": mine_type,
        "person_names": ["person"],
        "tags": json.dumps(tags),
        "location": location,
        "image_embeddings": image_embeddings  # Bring your own vector
     }

Add object to batch queue

     uuid = semanticSchema.data.insert(
         properties=data_object,
         uuid=generate_uuid5(data_object),
     )

Krishna_C · October 8, 2024, 8:05am

@DudaNogueira pls provide schema for abv ? I am unable to find a solution for a combination of Bring your own vector and fileds which use wvt vectorizers. I need to make a presentation today. Your inputs will help me! Thanks!

DudaNogueira · October 14, 2024, 9:35pm

hi @Krishna_C !

Sorry for the delay.

This is how you would insert and query a named vector with bring your own vector:

# create the collection
client.collections.delete("NamedVectorCollection")
collection = client.collections.create(
    name="NamedVectorCollection",
    vectorizer_config=[
        wvc.config.Configure.NamedVectors.none(name="text_vector"),
        wvc.config.Configure.NamedVectors.none(name="title_vector")
    ],
    properties=[
        wvc.config.Property(
            name="text",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=True
        ),
        wvc.config.Property(
            name="title",
            data_type=wvc.config.DataType.TEXT,
            vectorize_property_name=True
        ),
    ]
)
# now we insert data
collection.data.insert({
        "text": "this is a text",
        "title": "this is a title"
    },
    vector={
        "text_vector": [1,2,3,4,5],
        "title_vector": [1,2,3,4,5,6,7,8,9,10]
    }
)
# now we query
query = collection.query.near_vector(
    target_vector=["text_vector"],
    near_vector=[5,4,3,2,1],
    return_metadata=wvc.query.MetadataQuery(distance=True)
)
print(query.objects[0].properties)
print(query.objects[0].metadata.distance)

This was my output:

{‘text’: ‘this is a text’, ‘title’: ‘this is a title’}
0.3636362552642822

Let me know if that helps!

Krishna_C · October 21, 2024, 10:04am

Thanks for your response @DudaNogueira ! Could you pls provide a code for Multivector search with bring your own vector field, along with a multivector which use multi2vec vectorizor. The search should happen across these vecors. Is this possible? The Bring Your own vector should not be part of the Multi2vecclip config. The search should be acorss 2 or more vectors. One of them is Bringyourownvector And ((multi2vec) Or Other named vector)

Topic		Replies	Views
Text search and multiple embeddings Support	4	401	September 19, 2024
Retrieval from a single multimodal vector space which was created using both image and text fields - Is there a search operator for this? Support	1	46	July 3, 2024
Multi2vec-clip without storing image Support	2	428	May 8, 2024
Use of named vectors with batch import and hybrid search functionality Support integration , developer-experience	4	1241	March 12, 2024
Cannot query multimodal vectorized collection from Next.js Support python , technical	4	172	April 24, 2025

Multimodal search with Bring your own vector

Insert the single record into Weaviate

Add object to batch queue

Related topics