I am new to Weaviate and would appreciate some clarification. Here’s a bit of background about my situation:
I have been working with vector databases, specifically testing Pinecone for a personal project. In my project, I want to implement hybrid search, but with Pinecone, I face a challenge. To perform hybrid search there, I need to create a sparse vector beforehand. This requires access to the full corpus to create a featurizer for the sparse vector. The problem arises because I don’t have access to the entire corpus upfront — data is inserted incrementally as it becomes available.
I understand that Weaviate handles hybrid search differently. Specifically, it seems like I don’t need to explicitly create or insert a sparse vector during data insertion, unlike with other databases like Pinecone. I would like to confirm if this understanding is correct.
My Questions:
Data Insertion for Hybrid Search: Is it true that when using Weaviate, I don’t need to specify any additional arguments related to hybrid search during data insertion?
How Weaviate Implements Hybrid Search: Could you explain how Weaviate manages hybrid search without needing the sparse vector to be explicitly provided or created ahead of time?
When you create a collection in Weaviate, you can specify a vectorizer, like so:
from weaviate import classes as wvc
wcd_url = os.environ["WCD_DEMO_URL"]
wcd_api_key = os.environ["WCD_DEMO_RO_KEY"]
openai_api_key = os.environ["OPENAI_APIKEY"]
client = weaviate.connect_to_weaviate_cloud(
cluster_url=wcd_url, # Replace with your Weaviate Cloud URL
auth_credentials=wvc.init.Auth.api_key(wcd_api_key), # Replace with your Weaviate Cloud key
headers={"X-OpenAI-Api-Key": openai_api_key} # Replace with appropriate header key/value pair for the required API
)
ons = client.collections.create(
name="Question",
vectorizer_config=wvc.config.Configure.Vectorizer.text2vec_openai(), # If set to "none" you must always provide vectors yourself. Could be any other "text2vec-*" also.
generative_config=wvc.config.Configure.Generative.openai() # Ensure the `generative-openai` module is used for generative queries
)
Now Weaviate has everything it needs to vectorize your data: both for ingestion and for querying. Including the API Key for the embedder service (in this example, OpenAI)
That alone will allow you to do hybrid search, both providing your own vector query, or letting Weaviate vectorize it for you.
For example, here we are doing a keyword search for “food” and passing a vector for the vector search part:
My understanding is that to calculate sparse embeddings (such as TF-IDF or BM25), a featurizer is internally created by the vector database based on the ingested data. This could happen either periodically or every time new data is added, and the sparse embeddings for the corpus are then recalculated.
Question: Is this understanding correct? If so, how often are these recalculations performed — every time new data is ingested, or on a periodic basis? Also, can you refer me to any documentation/ code on the git repo that can help me understand it better.
So when you ingest/update data, Weaviate will vectorize your data, ingest and index into the vector index, like hnsw. That will give you the vector side of your search.
It will also create all the necessary inverted index for BM25 search. Then you can do the hybrid search, using both keyword and similarity search.
One resource I really like is our python recipes, as they give you reproducible examples on using Weaviate directly and with other integrations:
I have a question related to hybrid search. We are testing weaviate locally and doing the embedding out of the database, and pass it the vector to insert and query data. But, because of this, If I fill query parameter query in the following call, it raises an error because the database can´t embed the query so I don´t know how I can pass the “keyword” part to the hybrid search
results = self.collection.query.hybrid(query=None,# No se puede pasar str pq falla al intentar hacer el embedding en la BD y no está configurado
vector=vector,
return_metadata=weaviate.classes.query.MetadataQuery(distance=True,score=True, explain_score=True),
alpha=alpha,
limit=top_k)
After reading the forum, I have understood that the sparse vector is obtained inside the database. Is any configuration needed? How could we pass the text for the database to obtain the sparse vector in the hybrid search?