Knowledge Universe API — populate Weaviate with scored, multi-source knowledge in one call

Title: Knowledge Universe API — populate Weaviate with scored, multi-source knowledge in one call

Hey Weaviate community,

I built Knowledge Universe API and wanted to share a pattern that might be useful for anyone building RAG pipelines on Weaviate.

The problem it solves: getting fresh, structured knowledge into your Weaviate collection without writing individual crawlers for every source.

One API call retrieves from arXiv, GitHub, Wikipedia, StackOverflow, HuggingFace, Semantic Scholar and 8 more official sources simultaneously. Every result is scored across 5 dimensions (content quality, freshness, pedagogical fit, trust, social proof) before it reaches you.

The Weaviate integration:

import weaviate
import requests

# 1. Get scored + embedded knowledge
response = requests.post(
    'YOUR_API_URL/v1/discover',
    json={
        'topic': 'vector search optimization',
        'output_format': 'embeddings'  # returns 384-dim vectors
    }
)

# 2. Upsert directly into Weaviate
client = weaviate.Client(WEAVIATE_URL)
with client.batch as batch:
    for item in response.json()['embeddings']:
        batch.add_data_object(
            data_object={
                'title': item['title'],
                'url': item['url'],
                'platform': item['source_platform'],
                'quality_score': item.get('quality_score', 0),
            },
            class_name='KnowledgeSource',
            vector=item['vector']
        )

You get pre-scored, multi-source knowledge in your vector store. Quality score is stored as metadata so you can filter by it at query time.

Open source, MIT licensed: GitHub - VLSiddarth/Knowledge-Universe: "Find the best knowledge sources across the internet. For learning, research, and AI. 🌌" · GitHub
Free tier: 100 calls/month, no credit card.

Two questions for the community:

  1. Does the embedding schema above work well for your Weaviate setup, or would a different metadata structure be more useful?
  2. Are there knowledge sources you wish were covered that aren’t in the list above?

Happy to adjust the output format based on what actually fits Weaviate workflows.

Update: Knowledge Universe is now live in production.

Since the original post, we’ve shipped several significant
updates worth sharing with this community.


What’s new:

Live API (free tier, no credit card):

Python SDK:
pip install knowledge-universe

npm package:
npm install knowledge-universe


Performance on real queries:

Cold query across 14 sources: 3.1 seconds
Redis cache hit: 220ms
Faster than Tavily on cold queries (Tavily: 5.4s)


The Weaviate integration now works end-to-end:

import weaviate
import requests

# 1. Get scored + embedded knowledge (384-dim vectors)
response = requests.post(
    'https://huggingface.co/spaces/vlsiddarth/Knowledge-Universe/v1/discover',
    headers={'X-API-Key': 'your_key'},
    json={
        'topic': 'vector search optimization',
        'output_format': 'embeddings'
    }
)

# 2. Upsert directly into Weaviate with quality metadata
client = weaviate.Client(WEAVIATE_URL)
with client.batch as batch:
    for item in response.json()['embeddings']:
        batch.add_data_object(
            data_object={
                'title':         item['title'],
                'url':           item['url'],
                'platform':      item['source_platform'],
                'quality_score': item.get('quality_score', 0),
                'freshness':     item.get('freshness', 0.5),
                'decay_label':   item.get('decay_label', 'unknown'),
            },
            class_name='KnowledgeSource',
            vector=item['vector']
        )

The freshness score in metadata lets you filter at
query time — only retrieve sources above a freshness
threshold before they reach your LLM.


New feature: Coverage Confidence Score

Every query now returns a confidence signal:

"coverage_intelligence": {
  "confidence": 0.72,
  "confidence_label": "high",
  "coverage_warning": false
}

When confidence is low, the API suggests better queries
your agent can retry automatically. This maps directly
to Weaviate’s hybrid search use cases — if the
initial retrieval doesn’t match intent, the agent
knows to reformulate before storing.


Full technical writeup (architecture, decay formula,
performance benchmarks):


Still very interested in community feedback on
two questions from the original post:

  1. Is there a metadata schema that works better
    for your Weaviate collections? (We can add fields)
  2. Are there knowledge sources missing that would
    be useful for your RAG pipelines?
1 Like