Knowledge Universe API — populate Weaviate with scored, multi-source knowledge in one call

Title: Knowledge Universe API — populate Weaviate with scored, multi-source knowledge in one call

Hey Weaviate community,

I built Knowledge Universe API and wanted to share a pattern that might be useful for anyone building RAG pipelines on Weaviate.

The problem it solves: getting fresh, structured knowledge into your Weaviate collection without writing individual crawlers for every source.

One API call retrieves from arXiv, GitHub, Wikipedia, StackOverflow, HuggingFace, Semantic Scholar and 8 more official sources simultaneously. Every result is scored across 5 dimensions (content quality, freshness, pedagogical fit, trust, social proof) before it reaches you.

The Weaviate integration:

import weaviate
import requests

# 1. Get scored + embedded knowledge
response = requests.post(
    'YOUR_API_URL/v1/discover',
    json={
        'topic': 'vector search optimization',
        'output_format': 'embeddings'  # returns 384-dim vectors
    }
)

# 2. Upsert directly into Weaviate
client = weaviate.Client(WEAVIATE_URL)
with client.batch as batch:
    for item in response.json()['embeddings']:
        batch.add_data_object(
            data_object={
                'title': item['title'],
                'url': item['url'],
                'platform': item['source_platform'],
                'quality_score': item.get('quality_score', 0),
            },
            class_name='KnowledgeSource',
            vector=item['vector']
        )

You get pre-scored, multi-source knowledge in your vector store. Quality score is stored as metadata so you can filter by it at query time.

Open source, MIT licensed: GitHub - VLSiddarth/Knowledge-Universe: "Find the best knowledge sources across the internet. For learning, research, and AI. 🌌" · GitHub
Free tier: 100 calls/month, no credit card.

Two questions for the community:

  1. Does the embedding schema above work well for your Weaviate setup, or would a different metadata structure be more useful?
  2. Are there knowledge sources you wish were covered that aren’t in the list above?

Happy to adjust the output format based on what actually fits Weaviate workflows.

Update: Knowledge Universe is now live in production.

Since the original post, we’ve shipped several significant
updates worth sharing with this community.


What’s new:

Live API (free tier, no credit card):

Python SDK:
pip install knowledge-universe

npm package:
npm install knowledge-universe


Performance on real queries:

Cold query across 14 sources: 3.1 seconds
Redis cache hit: 220ms
Faster than Tavily on cold queries (Tavily: 5.4s)


The Weaviate integration now works end-to-end:

import weaviate
import requests

# 1. Get scored + embedded knowledge (384-dim vectors)
response = requests.post(
    'https://huggingface.co/spaces/vlsiddarth/Knowledge-Universe/v1/discover',
    headers={'X-API-Key': 'your_key'},
    json={
        'topic': 'vector search optimization',
        'output_format': 'embeddings'
    }
)

# 2. Upsert directly into Weaviate with quality metadata
client = weaviate.Client(WEAVIATE_URL)
with client.batch as batch:
    for item in response.json()['embeddings']:
        batch.add_data_object(
            data_object={
                'title':         item['title'],
                'url':           item['url'],
                'platform':      item['source_platform'],
                'quality_score': item.get('quality_score', 0),
                'freshness':     item.get('freshness', 0.5),
                'decay_label':   item.get('decay_label', 'unknown'),
            },
            class_name='KnowledgeSource',
            vector=item['vector']
        )

The freshness score in metadata lets you filter at
query time — only retrieve sources above a freshness
threshold before they reach your LLM.


New feature: Coverage Confidence Score

Every query now returns a confidence signal:

"coverage_intelligence": {
  "confidence": 0.72,
  "confidence_label": "high",
  "coverage_warning": false
}

When confidence is low, the API suggests better queries
your agent can retry automatically. This maps directly
to Weaviate’s hybrid search use cases — if the
initial retrieval doesn’t match intent, the agent
knows to reformulate before storing.


Full technical writeup (architecture, decay formula,
performance benchmarks):


Still very interested in community feedback on
two questions from the original post:

  1. Is there a metadata schema that works better
    for your Weaviate collections? (We can add fields)
  2. Are there knowledge sources missing that would
    be useful for your RAG pipelines?
1 Like

Knowledge Universe: From Data Ingest to Deterministic Layered Architecture

In April, I shared the v1 launch of the Knowledge Universe API—a system for scoring and embedding multi-source knowledge directly into Weaviate.

Since then, the ecosystem has completely evolved. We are no longer just building a data pipeline; we are building infrastructure for Retrieval-Augmented Generation (RAG), agentic systems, and deterministic knowledge discovery.

The new architecture moves past basic embeddings and introduces a strict, multi-layered governance model. Here is the structural breakdown of the new KU Ecosystem:

Layer 1: Semantic Retrieval (The Weaviate Layer)

  • Function: High-recall, high-speed retrieval of multi-source knowledge vectors.
  • General Equation:
 | Formula         |                  Expression                    |                     
 | Similarity      | $\cos(\theta) = \frac{A \cdot B}{||A|| ||B||}$ |

(Standard vector distance).

  • Reaction: The system returns a broad set of contextually relevant documents. This layer prioritizes semantic matching, pulling the raw data from the vector database before any compliance rules are applied.

Layer 2: Deterministic Governance (The KU API Layer)

  • Function: Post-retrieval hard-gating. This layer mathematically filters the retrieved payloads before they ever reach an LLM’s context window.
  • General Equation: $Governance\_Score = f(R, C, V)$
    (Where $R$ is baseline relevance, $C$ is the domain-specific compliance multiplier, and $V$ is domain velocity).If $Governance_Score < Threshold_{domain}$, the vector is dropped.
  • Reaction: Pure deterministic logic. The system either passes the data to the agent (Passed) or mathematically blocks it (Blocked). There are no LLM hallucinations in the scoring layer—just strict mathematical gating.

Layer 3: Autonomous Orchestration (The Agentic Layer)

  • Function: Executing continuous ReAct (Reason + Act) loops on pristine, governed context.
  • Reaction: Because Layer 2 filters out non-compliant data, the agent acts with maximum confidence. It avoids the logic failures and context degradation that typically kill fully autonomous systems in production.

The Ecosystem in Production Today:

  • KU API (api.knowledgeuniverse.tech): Live and handling hypersonic deterministic scoring.
  • ku-weaviate-integration: The core two-layer RAG pipeline natively pairing Weaviate’s Layer 1 retrieval with KU’s Layer 2 hard-gating.
  • Enterprise Wrappers (ku-forex & ku-obsidian-wiki): Applying this layered architecture to high-velocity financial market pipelines and local knowledge bases.

I am officially dropping back into the terminal to dogfood this exact layered architecture for a private, highly advanced autonomous agent project.

If your enterprise RAG pipeline is struggling with data governance and context control, the infrastructure is live.

api.knowledgeuniverse.tech

#KnowledgeUniverseAPI #RAG #AgenticAI #EnterpriseAI #AIGovernance #Weaviate