How ingest pdf into weaviate and perform RAG

SergioEanX · July 25, 2024, 9:44am

I’m trying to ingest data into weavite, a mix of text data and other formats like pdf that I convert to text batches using “unstructured”.
I’m basically following what reported at ingesting PDF but I suppose I’m missing something and/or I’m doing something wrong.
If I query data as follow:

response = coll.query.bm25(
        query="metal oxide",
        limit=2,   
        return_metadata=MetadataQuery(distance=True)
    )

I get a result, while using:

  res = coll.generate.near_text(
        query="metal oxide",
        limit=2,   
        # return_metadata=MetadataQuery(distance=True),
        single_prompt="Summarize {coll_name}, use a maximum of 20 words."
    )

I get nothing.
I would like to perform semantic search + RAG on property “files” (DataType.TEXT_ARRAY) containing batch text extracted using partition_pdf from usntructured.
Schema is the following:

DudaNogueira · July 25, 2024, 7:33pm

hi @SergioEanX !!

Welcome to our community

Check this recipe as it shows how to use Langchain to ingest some pdfs:

https://github.com/weaviate/recipes/tree/main/integrations/langchain/loading-data

While you may not use Langchain entirely, it will give you some hints on how to use the unstructured. That recipe specifically doesn’t use unstructured, but there is a lot of docs covering this, like here:

Also, you can not only load a single pdf, but an entire folder of contents, like in here:

Let me know if this helps.

Also, check this Academy course we have on chunking, as this is not a “one size fits all”, and some changes can be done for each use case to improve the overall quality of your results:

Let me know if this helps!

Thanks!

Topic		Replies	Views
Ingesting PDFs to Weaviate Resources integration	1	1402	June 7, 2023
How do I modify this script to create a weaviate vectorstore for multiple documents instead of one? General	1	543	November 1, 2023
Importing and querying diverse file formats with Unstructured: a demo by Erika and Shukri! Resources blog	0	982	May 23, 2023
How to perform semantic search in pdf file which is converted into text General wcs	3	619	May 22, 2024
Weaviate with OpenAi Support	6	1528	August 18, 2023

How ingest pdf into weaviate and perform RAG

Related topics