How to vectorize PDF content for semantic search?

Umesh_Narayanan · September 25, 2025, 9:28am

Description

Server Setup Information

Weaviate Server Version:
Deployment Method:
Multi Node? Number of Running Nodes:
Client Language and Version:
Multitenancy?:

Any additional Information

Mohamed_Shahin · September 25, 2025, 10:26am

Great to have you with us in the community

Here’s a blog from our Experts in Weaviate team that explains how to vectorize PDFs and different chunking strategies. Once you’ve gone through it, you can choose the approach that works best for you:

Wishing you a great week.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)

Umesh_Narayanan · September 28, 2025, 1:53pm

Hello @Mohamed_Shahin

Thanks for the details
As initial stage to building the vector DB in weaviate, the challenge i am facing to use the right api to create the collection and embeded chunk
Below is the simple code base which is failing after multiple fix suggested by exception
~~~

client instance
def get_client():

weaviate_url = os.environ[“WEAVIATE_URL”]

weaviate_api_key = os.environ[“WEAVIATE_API_KEY”]

client = weaviate.connect_to_weaviate_cloud(

  cluster_url=weaviate_url,

  auth_credentials=Auth.api_key(weaviate_api_key)

)

return client

this create the client

collection creation

def create_collection():

client = get_client()

if client.collections.exists(“claim_collection”):

client.collections.delete("claim_collection")

client.collections.create(

name="claim_collection",

properties=\[

    Property(name="chunk", data_type=DataType.TEXT),   

    Property(name="page", data_type=DataType.INT),   

    Property(name="source", data_type=DataType.TEXT) 

\],

vector_config=Configure.Vectors.text2vec_openai()

)

client.close()

Embdded data

collection = client.collections.get(“claim_collection”)

for i, doc in enumerate(chunk_data,1):

 collection.data.insert({

    "chunk": doc.page_content,

    "page": doc.metadata.get("page", i),   # if page info exists, else use i

    "source": "health_data.pdf"

})

client.close()

this time i am getting below error

‘vectorize target vector default: update vector: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY’}]}

~~~

The challenge here is i am not able to find the correct documentation for building vector database using collection api parameters. As see the schema structure already being removed in the version
4.16.10
.. so can you please help me in understanding the issue with right solution

it would be more appreciated if your knowledge base is updated with latest API changes

Thanks

Mohamed_Shahin · September 29, 2025, 9:44am

Good morning @Umesh_Narayanan

You’ll need to provide a vectorizer API key for objects to be vectorized, such as an OpenAI key

Example:

import weaviate
from weaviate.classes.init import Auth
import os

weaviate_url = os.environ["WEAVIATE_URL"]
weaviate_api_key = os.environ["WEAVIATE_API_KEY"]
openai_api_key = os.environ["OPENAI_APIKEY"]

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,  # Your Weaviate Cloud URL
    auth_credentials=Auth.api_key(weaviate_api_key),  # Your Weaviate Cloud key
    headers={"X-OpenAI-Api-key": openai_api_key}  # Your OpenAI API key
)

Topic		Replies	Views
How do I modify this script to create a weaviate vectorstore for multiple documents instead of one? General	1	695	November 1, 2023
How to get the Vector Store from Document Splitted and Embedding Support python	3	1000	June 10, 2024
How ingest pdf into weaviate and perform RAG General integration , documentation	1	1000	July 25, 2024
Fast start code sample and/or article on using a weaviate production cloud based cluster? Support wcs	4	831	April 18, 2024
Creating RAG using own data vectorized in Azure Support	3	435	September 18, 2024