Hi @justin.godden,
Your workflow should work, although, I would recommend using the NamedVectors
syntax to configure your vectorizers, as it provides a lot cleaner way to define what should be used for vectorization.
Let me provide guidance step by step.
Create a simple collection
First, create a collection with a named vector and specify source_properties
.
source_properties
– is the list of properties that should be used for vectorization (when a vector is not provided). This syntax is a lot easier to follow that using skip_vectorization
from weaviate.classes.config import Configure
client.collections.create(
"Article",
vectorizer_config=[
Configure.NamedVectors.text2vec_huggingface(
name="content_vector",
model=EMBEDDING_MODEL_NAME,
source_properties=["title"] # the list of properties used for vectorization
),
],
)
Notes on NamedVectors.text2vec_huggingface
name
– this is the name of your vector space. Since it looks like you will only work with one vector per object, the name doesn’t matter too much.
source_properties
– this is the list of properties used for vectorization.
source_properties
will only be used if you insert/update an object without providing a vector. So, for your initial import, this will get ignored. But it will be used when you add objects without vectors after.
Also, you don’t need to vectorize_property_name=False
and vectorize_collection_name=False
as these are set to false by default.
Create a simple collection with (optional) property schema
You can also provide the property schema with named vectors, but that won’t affect the source_properties
defined in the named vectors.
Also, providing skip_vectorization
in the property schema will be ignored, as the source_properties
take precedence.
from weaviate.classes.config import Configure, Property, DataType
client.collections.create(
"Article",
vectorizer_config=[
Configure.NamedVectors.text2vec_huggingface(
name="content_vector",
model=EMBEDDING_MODEL_NAME,
source_properties=["title", "body"] # the list of properties used for vectorization
),
],
properties=[
Property(name="title", data_type=DataType.TEXT),
Property(name="body", data_type=DataType.TEXT),
Property(
name="author",
data_type=DataType.TEXT,
skip_vectorization=False, # this will get ignored, as source_properties already define what should be used for vectorization
),
],
)
Initial data load
Then you can insert your data with your vectors – and since you will provide your vectors, the vectorizer will not be used.
Here is the example in the docs.
articles = client.collections.get("Article")
with articles.batch.dynamic() as batch:
for item in your_data_list:
batch.add_object(
properties={ # pass the properties of your objects
"title": item["title"],
"body": item["body"],
"author": item["author"],
},
vector={ # together with the vector
"content_vector": item["vector"], # `content_vector` is the name of the vector space
}
)
Query
Then you can run a query on your collection, where Weaviate will generate a vector embedding from the provided query
.
Note, the query is not affected by source_properties
.
articles = client.collections.get("Article")
response = articles.query.near_text(
query="a sweet German white wine",
limit=2,
)