How to vectorize PDF content for semantic search?

Description

Server Setup Information

  • Weaviate Server Version:
  • Deployment Method:
  • Multi Node? Number of Running Nodes:
  • Client Language and Version:
  • Multitenancy?:

Any additional Information

Hi @Umesh_Narayanan

Great to have you with us in the community :hugs:

Here’s a blog from our Experts in Weaviate team that explains how to vectorize PDFs and different chunking strategies. Once you’ve gone through it, you can choose the approach that works best for you:

Wishing you a great week.

Best regards,

Mohamed Shahin
Weaviate Support Engineer
(Ireland, UTC±00:00/+01:00)

Hello @Mohamed_Shahin

Thanks for the details
As initial stage to building the vector DB in weaviate, the challenge i am facing to use the right api to create the collection and embeded chunk
Below is the simple code base which is failing after multiple fix suggested by exception
~~~

client instance
def get_client():

weaviate_url = os.environ[“WEAVIATE_URL”]

weaviate_api_key = os.environ[“WEAVIATE_API_KEY”]

client = weaviate.connect_to_weaviate_cloud(

  cluster_url=weaviate_url,

  auth_credentials=Auth.api_key(weaviate_api_key)

)

return client

this create the client

collection creation

def create_collection():

client = get_client()

if client.collections.exists(“claim_collection”):

client.collections.delete("claim_collection")

client.collections.create(

name="claim_collection",

properties=\[

    Property(name="chunk", data_type=DataType.TEXT),   

    Property(name="page", data_type=DataType.INT),   

    Property(name="source", data_type=DataType.TEXT) 

\],

vector_config=Configure.Vectors.text2vec_openai()

)

client.close()

Embdded data

collection = client.collections.get(“claim_collection”)

for i, doc in enumerate(chunk_data,1):

 collection.data.insert({

    "chunk": doc.page_content,

    "page": doc.metadata.get("page", i),   # if page info exists, else use i

    "source": "health_data.pdf"

})

client.close()

this time i am getting below error

‘vectorize target vector default: update vector: API Key: no api key found neither in request header: X-Openai-Api-Key nor in environment variable under OPENAI_APIKEY’}]}

~~~

The challenge here is i am not able to find the correct documentation for building vector database using collection api parameters. As see the schema structure already being removed in the version
4.16.10
.. so can you please help me in understanding the issue with right solution

it would be more appreciated if your knowledge base is updated with latest API changes

Thanks

Good morning @Umesh_Narayanan

You’ll need to provide a vectorizer API key for objects to be vectorized, such as an OpenAI key

Example:

import weaviate
from weaviate.classes.init import Auth
import os

weaviate_url = os.environ["WEAVIATE_URL"]
weaviate_api_key = os.environ["WEAVIATE_API_KEY"]
openai_api_key = os.environ["OPENAI_APIKEY"]

client = weaviate.connect_to_weaviate_cloud(
    cluster_url=weaviate_url,  # Your Weaviate Cloud URL
    auth_credentials=Auth.api_key(weaviate_api_key),  # Your Weaviate Cloud key
    headers={"X-OpenAI-Api-key": openai_api_key}  # Your OpenAI API key
)