How to perform semantic search in pdf file which is converted into text

I have taken multiple pdf files and converted those pdf file into text and store in a schema differently. So Now i have to perform semantic search so how should do ?

hi @john_wick123 ! Welcome to our community! :hugs:

You can only perform search on one collection per time.

With that said, you can do a near_text search o a hybrid search.

here is how you can do a similarity search:

For instance:

reviews = client.collections.get("WineReviewNV")
response = reviews.query.near_text(
    query="a sweet German white wine",
    target_vector="title_country",  # Specify the target vector for named vector collections

and here is how you can do a hybrid search:

For instance:

reviews = client.collections.get("WineReviewNV")
response = reviews.query.hybrid(
    query="A French Riesling", target_vector="title_country", limit=3

We also have a repository with some recipes that can help you with more examples:

Let me know if this helps!


import weaviate
import json
import os
import PyPDF2
from PyPDF2 import PdfFileReader 
import torch
from transformers import AutoTokenizer, AutoModel

# Connecting locally
client = weaviate.Client(
    timeout_config=(30, 600),

# Path of pdf files
path = r"C:\Users\User\Desktop\Weaviate\PDF_Files"

# Checking whether the files are present in directory if present then storing in list
pdf_files = []
for file in os.listdir(path):
    if file.endswith(".pdf"):
        pdf_files.append(os.path.join(path, file))

# Models
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Function text to vector
def get_text_vector(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    vector = torch.mean(outputs.last_hidden_state, dim=1).squeeze(0).numpy().tolist()
    return vector

# Function to create a class
def create_schema(class_name):
    schema = {
            "class": class_name,
            "vectorizeClassName": False,
            "inferenceUrl": "http://t2v-transformers:8081",  
        "properties": [ 
                # Text
                    "indexSearchable": True,
                # Source
                    "vectorIndexType": "hnsw"
    return schema

# Function to import data into Weaviate
def import_data(pdf_files):
    for source in pdf_files:
        # Extracting class name from file name
        class_name = os.path.splitext(os.path.basename(source))[0]

        # Creating schema for the individual class
        schema = create_schema(class_name)
        print(f"Schema for class '{class_name}' created.")

        # Extracting text from each PDF and importing data locally 
        read = PyPDF2.PdfReader(source)
        num_pages = len(read.pages)
        all_data = []
        with client.batch as batch:
            for i in range(num_pages):
                text = read.pages[i].extract_text()
                page_no = i + 1
                text_vector = get_text_vector(text)
                data = {
                    "text": text,
                    "page_no": page_no,
                    "source": os.path.basename(source),
                batch.add_data_object(data, class_name)
                print(json.dumps(data, indent=2)) 
        output_file = os.path.splitext(source)[0] + '.json'
        with open(output_file, 'w') as json_file:
            json.dump(all_data, json_file, indent=2)    

# Import data into Weaviate

This is my code for importing multiple pdf files and converting pdf into text and each text into vectors Now i have to perform a query for example when was messi born ? so what search should i use , I am using hybrid search but i am not getting the accurate results


So first you add each pdf to it’s own collection. Not sure this is what you really want, but be aware that you will only be able to search for a single PDF file at a time.

Second, you are defining a vectorizer (text2vec-transformers) while passing your own vectors. The ingested vectors generated using get_text_vector must be the same ones that comes out of http://t2v-transformers:8081

Third, you are adding your vectors as a property, and not as vectors.

This is the correct way, considering your code:

data = {
    "text": text,
    "page_no": page_no,
    "source": os.path.basename(source),
batch.add_data_object(data, class_name, vector=text_vector)

Check here the link for that:

Let me know if this helps.