How to perform semantic search in pdf file which is converted into text

I have taken multiple pdf files and converted those pdf file into text and store in a schema differently. So Now i have to perform semantic search so how should do ?

hi @john_wick123 ! Welcome to our community! :hugs:

You can only perform search on one collection per time.

With that said, you can do a near_text search o a hybrid search.

here is how you can do a similarity search:

For instance:

reviews = client.collections.get("WineReviewNV")
response = reviews.query.near_text(
    query="a sweet German white wine",
    limit=2,
    target_vector="title_country",  # Specify the target vector for named vector collections
    return_metadata=MetadataQuery(distance=True)
)

and here is how you can do a hybrid search:

For instance:

reviews = client.collections.get("WineReviewNV")
response = reviews.query.hybrid(
    query="A French Riesling", target_vector="title_country", limit=3
)

We also have a repository with some recipes that can help you with more examples:

Let me know if this helps!

Thanks!

import weaviate
import json
import os
import PyPDF2
from PyPDF2 import PdfFileReader 
import torch
from transformers import AutoTokenizer, AutoModel

# Connecting locally
client = weaviate.Client(
    url="http://localhost:8080",
    timeout_config=(30, 600),
)

# Path of pdf files
path = r"C:\Users\User\Desktop\Weaviate\PDF_Files"

# Checking whether the files are present in directory if present then storing in list
pdf_files = []
for file in os.listdir(path):
    if file.endswith(".pdf"):
        pdf_files.append(os.path.join(path, file))
print(pdf_files)

# Models
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Function text to vector
def get_text_vector(text):
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    vector = torch.mean(outputs.last_hidden_state, dim=1).squeeze(0).numpy().tolist()
    return vector


# Function to create a class
def create_schema(class_name):
    schema = {
            "class": class_name,
            "vectorizer":"text2vec-transformers",
            "moduleConfig":{
            "vectorizeClassName": False,
            "inferenceUrl": "http://t2v-transformers:8081",  
        },
        "properties": [ 
                # Text
                
                {
                    "name":"text",
                    "dataType":["text"],
                    "indexSearchable": True,
                    
                },
                # Source
                {
                    "name":"source",
                    "dataType":["text"]
                },
                #Page_no
                {
                    "name":"page_no",
                    "dataType":["int"]
                },
                #vector
                {
                    "name":"text_vector",
                    "dataType":["number[]"],
                    "vectorIndexType": "hnsw"
                }
            ]
        }
    return schema

# Function to import data into Weaviate
def import_data(pdf_files):
    for source in pdf_files:
        # Extracting class name from file name
        class_name = os.path.splitext(os.path.basename(source))[0]

        # Creating schema for the individual class
        schema = create_schema(class_name)
        client.schema.create_class(schema)
        print(f"Schema for class '{class_name}' created.")

        # Extracting text from each PDF and importing data locally 
        read = PyPDF2.PdfReader(source)
        num_pages = len(read.pages)
        all_data = []
        client.batch.configure(batch_size=20) 
        with client.batch as batch:
            for i in range(num_pages):
                text = read.pages[i].extract_text()
                page_no = i + 1
                text_vector = get_text_vector(text)
                data = {
                    "text": text,
                    "page_no": page_no,
                    "source": os.path.basename(source),
                    "text_vector":text_vector,
                }
                batch.add_data_object(data, class_name)
                all_data.append(data)
                print(json.dumps(data, indent=2)) 
        output_file = os.path.splitext(source)[0] + '.json'
        with open(output_file, 'w') as json_file:
            json.dump(all_data, json_file, indent=2)    

# Import data into Weaviate
import_data(pdf_files)


This is my code for importing multiple pdf files and converting pdf into text and each text into vectors Now i have to perform a query for example when was messi born ? so what search should i use , I am using hybrid search but i am not getting the accurate results

Hi!

So first you add each pdf to itโ€™s own collection. Not sure this is what you really want, but be aware that you will only be able to search for a single PDF file at a time.

Second, you are defining a vectorizer (text2vec-transformers) while passing your own vectors. The ingested vectors generated using get_text_vector must be the same ones that comes out of http://t2v-transformers:8081

Third, you are adding your vectors as a property, and not as vectors.

This is the correct way, considering your code:

data = {
    "text": text,
    "page_no": page_no,
    "source": os.path.basename(source),
}
batch.add_data_object(data, class_name, vector=text_vector)

Check here the link for that:

Let me know if this helps.

Thanks!