Max file size for pdf imports & Connection Interruption Error

bam · November 12, 2024, 9:52pm

I’m trying to upload a pdf to train the LLM. I’m using Weaviate in a docker-compose file, running verba and ollama3 locally. I’ve imported multiple pdfs already.
I’m using a MacBook Pro running 13.17.1 Ventura.

It looses connection on some pdfs. Some are large and others are a few kbs in size.
Is there a max file size for importing files?

Server Setup Information

Weaviate Server Version:
Deployment Method:
Multi Node? Number of Running Nodes:
Client Language and Version:
Multitenancy?:

Screenshot 2024-11-12 at 20.51.261928×1262 128 KB

Screenshot 2024-11-12 at 20.51.331966×1242 166 KB

Any additional Information

Error message:

Debugging File Configuration
{
  "fileID": "cpt7-nasm-candidate-handbook.pdf",
  "filename": "cpt7-nasm-candidate-handbook.pdf",
  "extension": "pdf",
  "status_report": {
    "STARTING": {
      "fileID": "cpt7-nasm-candidate-handbook.pdf",
      "status": "STARTING",
      "message": "Starting Import",
      "took": 0
    },
    "LOADING": {
      "fileID": "cpt7-nasm-candidate-handbook.pdf",
      "status": "LOADING",
      "message": "Loaded cpt7-nasm-candidate-handbook.pdf",
      "took": 3.97
    },
    "CHUNKING": {
      "fileID": "cpt7-nasm-candidate-handbook.pdf",
      "status": "CHUNKING",
      "message": "Split cpt7-nasm-candidate-handbook.pdf into 96 chunks",
      "took": 0.04
    },
    "EMBEDDING": {
      "fileID": "cpt7-nasm-candidate-handbook.pdf",
      "status": "EMBEDDING",
      "message": "",
      "took": 0
    },
    "ERROR": {
      "fileID": "cpt7-nasm-candidate-handbook.pdf",
      "status": "ERROR",
      "message": "Connection was interrupted",
      "took": 0
    }
  },
  "source": "",
  "isURL": false,
  "metadata": "",
  "overwrite": false,
  "content": "File Content",
  "labels": [
    "Document"
  ],
  "rag_config": {
    "Reader": {
      "selected": "Default",
      "components": {
        "Default": {
          "name": "Default",
          "variables": [],
          "library": [
            "pypdf",
            "docx",
            "spacy"
          ],
          "description": "Ingests text, code, PDF, and DOCX files",
          "config": {},
          "type": "FILE",
          "available": true
        },
        "HTML": {
          "name": "HTML",
          "variables": [],
          "library": [
            "markdownify",
            "beautifulsoup4"
          ],
          "description": "Downloads and ingests HTML from a URL, with optional recursive fetching.",
          "config": {
            "URLs": {
              "type": "multi",
              "value": "",
              "description": "Add URLs to retrieve data from",
              "values": []
            },
            "Convert To Markdown": {
              "type": "bool",
              "value": 0,
              "description": "Should the HTML be converted into markdown?",
              "values": []
            },
            "Recursive": {
              "type": "bool",
              "value": 0,
              "description": "Fetch linked pages recursively",
              "values": []
            },
            "Max Depth": {
              "type": "number",
              "value": 3,
              "description": "Maximum depth for recursive fetching",
              "values": []
            }
          },
          "type": "URL",
          "available": false
        },
        "Git": {
          "name": "Git",
          "variables": [],
          "library": [],
          "description": "Downloads and ingests all files from a GitHub or GitLab Repo.",
          "config": {
            "Platform": {
              "type": "dropdown",
              "value": "GitHub",
              "description": "Select the Git platform",
              "values": [
                "GitHub",
                "GitLab"
              ]
            },
            "Owner": {
              "type": "text",
              "value": "",
              "description": "Enter the repo owner (GitHub) or group/user (GitLab)",
              "values": []
            },
            "Name": {
              "type": "text",
              "value": "",
              "description": "Enter the repo name",
              "values": []
            },
            "Branch": {
              "type": "text",
              "value": "main",
              "description": "Enter the branch name",
              "values": []
            },
            "Path": {
              "type": "text",
              "value": "",
              "description": "Enter the path or leave it empty to import all",
              "values": []
            },
            "Git Token": {
              "type": "password",
              "value": "",
              "description": "You can set your GitHub/GitLab Token here if you haven't set it up as environment variable `GITHUB_TOKEN` or `GITLAB_TOKEN`",
              "values": []
            }
          },
          "type": "URL",
          "available": true
        },
        "Unstructured IO": {
          "name": "Unstructured IO",
          "variables": [
            "UNSTRUCTURED_API_KEY"
          ],
          "library": [],
          "description": "Uses the Unstructured API to import multiple file types such as plain text and documents",
          "config": {
            "Strategy": {
              "type": "dropdown",
              "value": "auto",
              "description": "Set the extraction strategy",
              "values": [
                "auto",
                "hi_res",
                "ocr_only",
                "fast"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "Set your Unstructured API Key here or set it as an environment variable `UNSTRUCTURED_API_KEY`",
              "values": []
            },
            "API URL": {
              "type": "text",
              "value": "https://api.unstructured.io/general/v0/general",
              "description": "Set the base URL to the Unstructured API or set it as an environment variable `UNSTRUCTURED_API_URL`",
              "values": []
            }
          },
          "type": "FILE",
          "available": false
        },
        "Firecrawl": {
          "name": "Firecrawl",
          "variables": [],
          "library": [],
          "description": "Use Firecrawl to scrape websites and ingest them into Verba",
          "config": {
            "Mode": {
              "type": "dropdown",
              "value": "Scrape",
              "description": "Switch between scraping and crawling. Note that crawling can take some time.",
              "values": [
                "Crawl",
                "Scrape"
              ]
            },
            "URLs": {
              "type": "multi",
              "value": "",
              "description": "Add URLs to retrieve data from",
              "values": []
            },
            "Firecrawl API Key": {
              "type": "password",
              "value": "",
              "description": "You can set your Firecrawl API Key or set it as environment variable `FIRECRAWL_API_KEY`",
              "values": []
            }
          },
          "type": "URL",
          "available": true
        }
      }
    },
    "Chunker": {
      "selected": "Token",
      "components": {
        "Token": {
          "name": "Token",
          "variables": [],
          "library": [],
          "description": "Splits documents based on word tokens",
          "config": {
            "Tokens": {
              "type": "number",
              "value": 250,
              "description": "Choose how many Token per chunks",
              "values": []
            },
            "Overlap": {
              "type": "number",
              "value": 50,
              "description": "Choose how many Tokens should overlap between chunks",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "Sentence": {
          "name": "Sentence",
          "variables": [],
          "library": [],
          "description": "Splits documents based on word tokens",
          "config": {
            "Sentences": {
              "type": "number",
              "value": 5,
              "description": "Choose how many Sentences per chunks",
              "values": []
            },
            "Overlap": {
              "type": "number",
              "value": 1,
              "description": "Choose how many Sentences should overlap between chunks",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "Recursive": {
          "name": "Recursive",
          "variables": [],
          "library": [
            "langchain_text_splitters "
          ],
          "description": "Recursively split documents based on predefined characters using LangChain",
          "config": {
            "Chunk Size": {
              "type": "number",
              "value": 500,
              "description": "Choose how many characters per chunks",
              "values": []
            },
            "Overlap": {
              "type": "number",
              "value": 100,
              "description": "Choose how many characters per chunks",
              "values": []
            },
            "Seperators": {
              "type": "multi",
              "value": "",
              "description": "Select seperators to split the text",
              "values": [
                "\n\n",
                "\n",
                " ",
                ".",
                ",",
                "",
                "，",
                "、",
                "．",
                "。",
                ""
              ]
            }
          },
          "type": "",
          "available": false
        },
        "Semantic": {
          "name": "Semantic",
          "variables": [],
          "library": [
            "sklearn"
          ],
          "description": "Split documents based on semantic similarity or max sentences",
          "config": {
            "Breakpoint Percentile Threshold": {
              "type": "number",
              "value": 80,
              "description": "Percentile Threshold to split and create a chunk, the lower the more chunks you get",
              "values": []
            },
            "Max Sentences Per Chunk": {
              "type": "number",
              "value": 20,
              "description": "Maximum number of sentences per chunk",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "HTML": {
          "name": "HTML",
          "variables": [],
          "library": [
            "langchain_text_splitters "
          ],
          "description": "Split documents based on HTML tags using LangChain",
          "config": {},
          "type": "",
          "available": false
        },
        "Markdown": {
          "name": "Markdown",
          "variables": [],
          "library": [
            "langchain_text_splitters"
          ],
          "description": "Split documents based on markdown formatting using LangChain",
          "config": {},
          "type": "",
          "available": true
        },
        "Code": {
          "name": "Code",
          "variables": [],
          "library": [
            "langchain_text_splitters "
          ],
          "description": "Split code based on programming language using LangChain",
          "config": {
            "Language": {
              "type": "dropdown",
              "value": "python",
              "description": "Select programming language",
              "values": [
                "cpp",
                "go",
                "java",
                "kotlin",
                "js",
                "ts",
                "php",
                "proto",
                "python",
                "rst",
                "ruby",
                "rust",
                "scala",
                "swift",
                "markdown",
                "latex",
                "html",
                "sol",
                "csharp",
                "cobol",
                "c",
                "lua",
                "perl",
                "haskell",
                "elixir"
              ]
            },
            "Chunk Size": {
              "type": "number",
              "value": 500,
              "description": "Choose how many characters per chunk",
              "values": []
            },
            "Chunk Overlap": {
              "type": "number",
              "value": 50,
              "description": "Choose how many characters overlap between chunks",
              "values": []
            }
          },
          "type": "",
          "available": false
        },
        "JSON": {
          "name": "JSON",
          "variables": [],
          "library": [
            "langchain_text_splitters "
          ],
          "description": "Split json files using LangChain",
          "config": {
            "Chunk Size": {
              "type": "number",
              "value": 500,
              "description": "Choose how many characters per chunks",
              "values": []
            }
          },
          "type": "",
          "available": false
        }
      }
    },
    "Embedder": {
      "selected": "Ollama",
      "components": {
        "Ollama": {
          "name": "Ollama",
          "variables": [],
          "library": [],
          "description": "Vectorizes documents and queries using Ollama. If your Ollama instance is not running on http://localhost:11434, you can change the URL by setting the OLLAMA_URL environment variable.",
          "config": {
            "Model": {
              "type": "dropdown",
              "value": "snowflake-arctic-embed:latest",
              "description": "Select a installed Ollama model from http://localhost:11434. You can change the URL by setting the OLLAMA_URL environment variable. ",
              "values": [
                "snowflake-arctic-embed:latest",
                "llama3:latest",
                "codegemma:latest",
                "llama3.2:latest"
              ]
            }
          },
          "type": "",
          "available": true
        },
        "SentenceTransformers": {
          "name": "SentenceTransformers",
          "variables": [],
          "library": [
            "sentence_transformers"
          ],
          "description": "Embeds and retrieves objects using SentenceTransformer",
          "config": {
            "Model": {
              "type": "dropdown",
              "value": "all-MiniLM-L6-v2",
              "description": "Select an HuggingFace Embedding Model",
              "values": [
                "all-MiniLM-L6-v2",
                "mixedbread-ai/mxbai-embed-large-v1",
                "all-mpnet-base-v2",
                "BAAI/bge-m3",
                "all-MiniLM-L12-v2",
                "paraphrase-MiniLM-L6-v2"
              ]
            }
          },
          "type": "",
          "available": false
        },
        "Weaviate": {
          "name": "Weaviate",
          "variables": [],
          "library": [],
          "description": "Vectorizes documents and queries using Weaviate's In-House Embedding Service.",
          "config": {
            "Model": {
              "type": "dropdown",
              "value": "Embedding Service",
              "description": "Select a Weaviate Embedding Service Model",
              "values": [
                "Embedding Service"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "Weaviate Embedding Service Key (or set EMBEDDING_SERVICE_KEY env var)",
              "values": []
            },
            "URL": {
              "type": "text",
              "value": "",
              "description": "Weaviate Embedding Service URL (if different from default)",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "VoyageAI": {
          "name": "VoyageAI",
          "variables": [],
          "library": [],
          "description": "Vectorizes documents and queries using VoyageAI",
          "config": {
            "Model": {
              "type": "dropdown",
              "value": "voyage-2",
              "description": "Select a VoyageAI Embedding Model",
              "values": [
                "voyage-2",
                "voyage-large-2",
                "voyage-finance-2",
                "voyage-multilingual-2",
                "voyage-law-2",
                "voyage-code-2"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "OpenAI API Key (or set OPENAI_API_KEY env var)",
              "values": []
            },
            "URL": {
              "type": "text",
              "value": "https://api.voyageai.com/v1",
              "description": "OpenAI API Base URL (if different from default)",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "Cohere": {
          "name": "Cohere",
          "variables": [],
          "library": [],
          "description": "Vectorizes documents and queries using Cohere",
          "config": {
            "Model": {
              "type": "dropdown",
              "value": "embed-english-v3.0",
              "description": "Select a Cohere Embedding Model",
              "values": [
                "embed-english-v3.0",
                "embed-multilingual-v3.0",
                "embed-english-light-v3.0",
                "embed-multilingual-light-v3.0"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "You can set your Cohere API Key here or set it as environment variable `COHERE_API_KEY`",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "OpenAI": {
          "name": "OpenAI",
          "variables": [],
          "library": [],
          "description": "Vectorizes documents and queries using OpenAI",
          "config": {
            "Model": {
              "type": "dropdown",
              "value": "text-embedding-3-small",
              "description": "Select an OpenAI Embedding Model",
              "values": [
                "text-embedding-ada-002",
                "text-embedding-3-small",
                "text-embedding-3-large"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "OpenAI API Key (or set OPENAI_API_KEY env var)",
              "values": []
            },
            "URL": {
              "type": "text",
              "value": "https://api.openai.com/v1",
              "description": "OpenAI API Base URL (if different from default)",
              "values": []
            }
          },
          "type": "",
          "available": true
        }
      }
    },
    "Retriever": {
      "selected": "Advanced",
      "components": {
        "Advanced": {
          "name": "Advanced",
          "variables": [],
          "library": [],
          "description": "Retrieve relevant chunks from Weaviate",
          "config": {
            "Suggestion": {
              "type": "bool",
              "value": 1,
              "description": "Enable Autocomplete Suggestions",
              "values": []
            },
            "Search Mode": {
              "type": "dropdown",
              "value": "Hybrid Search",
              "description": "Switch between search types.",
              "values": [
                "Hybrid Search"
              ]
            },
            "Limit Mode": {
              "type": "dropdown",
              "value": "Autocut",
              "description": "Method for limiting the results. Autocut decides automatically how many chunks to retrieve, while fixed sets a fixed limit.",
              "values": [
                "Autocut",
                "Fixed"
              ]
            },
            "Limit/Sensitivity": {
              "type": "number",
              "value": 1,
              "description": "Value for limiting the results. Value controls Autocut sensitivity and Fixed Size",
              "values": []
            },
            "Chunk Window": {
              "type": "number",
              "value": 1,
              "description": "Number of surrounding chunks of retrieved chunks to add to context",
              "values": []
            },
            "Threshold": {
              "type": "number",
              "value": 80,
              "description": "Threshold of chunk score to apply window technique (1-100)",
              "values": []
            }
          },
          "type": "",
          "available": true
        }
      }
    },
    "Generator": {
      "selected": "Ollama",
      "components": {
        "Ollama": {
          "name": "Ollama",
          "variables": [],
          "library": [],
          "description": "Generate answers using Ollama. If your Ollama instance is not running on http://localhost:11434, you can change the URL by setting the OLLAMA_URL environment variable.",
          "config": {
            "System Message": {
              "type": "text",
              "value": "You are Fit T Centenarian, a chatbot for Retrieval Augmented Generation (RAG). You will receive a user query and context pieces that have a semantic similarity to that query. Please answer these user queries only with the provided context. Mention documents you used from the context if you use them to reduce hallucination. If the provided documentation does not provide enough information, say so. If the user asks questions about you as a chatbot specifially, answer them naturally. If the answer requires code examples encapsulate them with ```programming-language-name ```. Don't do pseudo-code.",
              "description": "System Message",
              "values": []
            },
            "Model": {
              "type": "dropdown",
              "value": "llama3.2:latest",
              "description": "Select an installed Ollama model from http://localhost:11434.",
              "values": [
                "snowflake-arctic-embed:latest",
                "llama3:latest",
                "codegemma:latest",
                "llama3.2:latest"
              ]
            }
          },
          "type": "",
          "available": true
        },
        "OpenAI": {
          "name": "OpenAI",
          "variables": [],
          "library": [],
          "description": "Using OpenAI LLM models to generate answers to queries",
          "config": {
            "System Message": {
              "type": "text",
              "value": "You are Verba, a chatbot for Retrieval Augmented Generation (RAG). You will receive a user query and context pieces that have a semantic similarity to that query. Please answer these user queries only with the provided context. Mention documents you used from the context if you use them to reduce hallucination. If the provided documentation does not provide enough information, say so. If the user asks questions about you as a chatbot specifially, answer them naturally. If the answer requires code examples encapsulate them with ```programming-language-name ```. Don't do pseudo-code.",
              "description": "System Message",
              "values": []
            },
            "Model": {
              "type": "dropdown",
              "value": "gpt-4o",
              "description": "Select an OpenAI Embedding Model",
              "values": [
                "gpt-4o",
                "gpt-3.5-turbo"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "You can set your OpenAI API Key here or set it as environment variable `OPENAI_API_KEY`",
              "values": []
            },
            "URL": {
              "type": "text",
              "value": "https://api.openai.com/v1",
              "description": "You can change the Base URL here if needed",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "Anthropic": {
          "name": "Anthropic",
          "variables": [],
          "library": [],
          "description": "Using Anthropic LLM models to generate answers to queries",
          "config": {
            "System Message": {
              "type": "text",
              "value": "You are Verba, a chatbot for Retrieval Augmented Generation (RAG). You will receive a user query and context pieces that have a semantic similarity to that query. Please answer these user queries only with the provided context. Mention documents you used from the context if you use them to reduce hallucination. If the provided documentation does not provide enough information, say so. If the user asks questions about you as a chatbot specifially, answer them naturally. If the answer requires code examples encapsulate them with ```programming-language-name ```. Don't do pseudo-code.",
              "description": "System Message",
              "values": []
            },
            "Model": {
              "type": "dropdown",
              "value": "claude-3-5-sonnet-20240620",
              "description": "Select an Anthropic Model",
              "values": [
                "claude-3-5-sonnet-20240620"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "You can set your Anthropic API Key here or set it as environment variable `ANTHROPIC_API_KEY`",
              "values": []
            }
          },
          "type": "",
          "available": true
        },
        "Cohere": {
          "name": "Cohere",
          "variables": [],
          "library": [],
          "description": "Generator using Cohere's command-r-plus model",
          "config": {
            "System Message": {
              "type": "text",
              "value": "You are Verba, a chatbot for Retrieval Augmented Generation (RAG). You will receive a user query and context pieces that have a semantic similarity to that query. Please answer these user queries only with the provided context. Mention documents you used from the context if you use them to reduce hallucination. If the provided documentation does not provide enough information, say so. If the user asks questions about you as a chatbot specifially, answer them naturally. If the answer requires code examples encapsulate them with ```programming-language-name ```. Don't do pseudo-code.",
              "description": "System Message",
              "values": []
            },
            "Model": {
              "type": "dropdown",
              "value": "embed-english-v3.0",
              "description": "Select a Cohere Embedding Model",
              "values": [
                "embed-english-v3.0",
                "embed-multilingual-v3.0",
                "embed-english-light-v3.0",
                "embed-multilingual-light-v3.0"
              ]
            },
            "API Key": {
              "type": "password",
              "value": "",
              "description": "You can set your Cohere API Key here or set it as environment variable `COHERE_API_KEY`",
              "values": []
            }
          },
          "type": "",
          "available": true
        }
      }
    }
  },
  "file_size": 502504,`Preformatted text`
  "status": "ERROR"
}

DudaNogueira · November 13, 2024, 1:56pm

hi @bam !!

I believe those are related: Vectorization failed for some batches: 500, message='Internal Server Error'

Topic		Replies	Views
Vectorization failed for some batches: 500, message='Internal Server Error' Support developer-experience , technical	14	117	December 5, 2024
Error 413 - Payload too large error upon uploading files to weaviate hosted in Azure Kubernetes service Support	8	497	November 22, 2023
Locally running RAG pipeline with Verba and Llama3 with Ollama Support developer-experience , python	11	214	November 13, 2024
Couldn't connect to Weaviate, check your URL/API KEY: [Errno 30] Read-only file system: Support bug	11	87	November 14, 2024
Weaviate with OpenAi Support	6	1183	August 18, 2023

Max file size for pdf imports & Connection Interruption Error

Server Setup Information

Any additional Information

Related topics