RKI AMELAG wastewater surveillance data with CSV exports and bun fetch script. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Data
Purpose
The Data directory contains curated, ground-truth datasets about important aspects of human life, society, and progress, along with documentation for external data sources. This is a collection of reliable, parseable data that can be used for analysis, research, and informed decision-making.
🎯 "Answer First" Schema
All Substrate datasets follow the "Answer First" schema. Every dataset has a SUMMARY.md file that puts the best estimate at the top.
Quick Reference
🇺🇸 United States
| Dataset | Best Estimate | One-liner |
|---|---|---|
| Knowledge Worker Compensation | $35-50T global, $6-12T US | Global knowledge workers earn $35-50T annually |
| US GDP | $23.77T (Q2 2025) | U.S. real GDP is $23.77T, growing 3.8% quarterly |
| US Inflation | 2.5% YoY | U.S. inflation is ~2.5% with CPI at 323.4 |
| Presidential Approval | ~41% (Trump Nov 2025) | Trump approval averages ~41% (net -13) |
| COVID Wastewater | HIGH (5.6 log10) | California COVID wastewater is HIGH |
| US Common Metrics | 60+ indicators | Real-time dashboard of U.S. economic indicators |
| Pulitzer Winners | 249 winners | Complete Arts & Letters database (1918-2024) |
🇩🇪 Germany
| Dataset | Best Estimate | One-liner |
|---|---|---|
| DE Common Metrics | 29 indicators | Real-time dashboard of German economic indicators |
| DE Wealth Distribution | Gini 72.4% | Top 10% hold 54% of wealth, bottom 50% just 2.4% |
| DE Social Mobility | 61% vs 14% | Gymnasium rate gap by parental education |
| DE Democracy Metrics | Multiple | Democratic resilience indicators |
| DE Social Isolation | Multiple | Social connection and loneliness metrics |
| DE Mental Health | Multiple | Mental health indicators |
| DE Wellbeing | Multiple | Wellbeing indicators |
| DE World Values | Multiple | Value orientations (WVS/EVS) |
| DE Platform Media | Multiple | Platform dependency metrics |
| DE Epistemic Competence | Multiple | Media literacy indicators |
| DE Parliament Activity | Multiple | Bundestag legislative activity |
| DE Lobby Transparency | Multiple | Lobbyregister metrics |
| DE Federal Budget | Multiple | Bundeshaushalt data |
| DE Energy Mix | Multiple | Energy transition metrics |
| DE Church Exits | Multiple | Kirchenaustritte |
| DE Wastewater | Multiple | SARS-CoV-2 wastewater levels |
🇪🇺 European Union
| Dataset | Best Estimate | One-liner |
|---|---|---|
| EU Wealth Inequality | Gini 50.8–72.6 | Germany highest wealth Gini in Eurozone, Slovakia lowest |
Schema Structure
Every SUMMARY.md follows this structure:
# [Dataset Title]: Executive Summary
## 🎯 BEST ESTIMATE
| Metric | Value | Confidence | Last Updated |
|--------|-------|------------|--------------|
| **[Primary Metric]** | **[VALUE]** | [X%] | [DATE] |
**One-liner:** [12 words max - the quotable answer]
**Caveat:** [Single most important limitation]
---
## The Big Picture
[2-3 sentences: What this is, why it matters, major uncertainty]
## Why This Number Matters
[Context for why this metric is important]
## How the Number Is Calculated
[Methodology summary]
## Confidence Assessment
[What we know well vs. what's uncertain]
## Alternative Estimates & Why We Differ
[When applicable: other approaches and why we chose ours]
## Data Sources
[Links to authoritative sources]
## Supporting Documentation
[Links to detailed data files]
## Changelog
[When estimates changed and why]
Confidence Level Guidelines
| Level | Percentage | When to Use |
|---|---|---|
| Very High | 95%+ | Official government data, single authoritative source |
| High | 85-94% | Multiple corroborating sources, minor definitional variation |
| Medium | 65-84% | Extrapolated from good sources, definitional uncertainty |
| Low | <65% | Limited data, significant methodological issues |
Creating New Datasets
Use the DATASET-TEMPLATE.md when creating new datasets.
Mandatory Sections:
- 🎯 BEST ESTIMATE - Must be first content section after title
- One-liner - 12 words max, quotable
- Caveat - Single most important limitation
- Methodology Summary - How the estimate was derived
- Sources - Authoritative links
- Changelog - Track revisions with reasons
Recommended Section:
- Alternative Estimates & Why We Differ - When other estimates exist
Directory Structure
Data/
├── DATASET-TEMPLATE.md # Schema template for new datasets
├── README.md # This file
├── sources/ # External data source catalog (DS-00001 – DS-00020)
│
│ # 🇺🇸 US Datasets
├── Bay-Area-COVID-Wastewater/ # COVID wastewater surveillance
├── Knowledge-Worker-Global-Salaries/ # Knowledge economy compensation
├── Pulitzer-Prize-Winners/ # Arts & Letters Pulitzer data
├── US-Common-Metrics/ # 60+ US economic indicators
├── US-GDP/ # US GDP data (1929–2025)
├── US-Inflation/ # CPI/inflation data (1947–2025)
├── US-Presidential-Approval/ # Approval ratings (1937–2025)
│
│ # 🇩🇪 Germany Datasets (DE-Plan)
├── DE-Church-Exits/ # Kirchenaustritte
├── DE-Common-Metrics/ # 29 economic/demographic indicators
├── DE-Democracy-Metrics/ # Democratic resilience indicators
├── DE-Energy-Mix/ # Energy transition (SMARD)
├── DE-Epistemic-Competence/ # Media literacy (PISA/ICILS)
├── DE-Federal-Budget/ # Bundeshaushalt data
├── DE-Lobby-Transparency/ # Lobbyregister Bundestag
├── DE-Mental-Health/ # Mental health indicators
├── DE-Parliament-Activity/ # Bundestag legislative activity (DIP)
├── DE-Platform-Media/ # Platform dependency metrics
├── DE-Social-Isolation/ # Social connection metrics
├── DE-Social-Mobility/ # Education stratification
├── DE-Wastewater-Surveillance/ # RKI AMELAG SARS-CoV-2
├── DE-Wealth-Distribution/ # Wealth Gini, top shares, inheritance
├── DE-Wellbeing/ # Wellbeing indicators
├── DE-World-Values/ # Value orientations (WVS/EVS)
│
│ # 🇪🇺 EU Datasets
└── EU-Wealth-Inequality/ # Cross-country wealth Gini comparison
Start with the main .md file in any dataset directory — it gives you the answer first.
Dataset Categories
Economic Indicators
- US GDP - Gross Domestic Product (1929–2025)
- US Inflation - CPI data (1947–2025)
- US Common Metrics - 60+ economic indicators dashboard
- DE Common Metrics - 29 German economic/demographic indicators
- DE Federal Budget - Bundeshaushalt data
- Knowledge Worker Compensation - Global and US compensation estimates
Wealth & Inequality
- DE Wealth Distribution - Wealth Gini 72.4%, top shares, inheritance
- EU Wealth Inequality - Cross-country Gini comparison (50.8–72.6)
- DE Social Mobility - Education stratification by parental background
Democracy & Governance
- DE Democracy Metrics - Democratic resilience indicators
- DE Parliament Activity - Bundestag legislative activity
- DE Lobby Transparency - Lobbyregister Bundestag
- Presidential Approval - US approval ratings (1937–2025)
Media & Knowledge
- DE Platform Media - Platform dependency metrics
- DE Epistemic Competence - Media literacy (PISA/ICILS)
- Pulitzer Winners - Arts & Letters awards (1918–2024)
Social & Health
- DE Social Isolation - Social connection metrics
- DE Mental Health - Mental health indicators
- DE Wellbeing - Wellbeing indicators
- DE World Values - Value orientations (WVS/EVS)
- DE Church Exits - Kirchenaustritte
Health & Public Safety
- COVID Wastewater (Bay Area) - California wastewater surveillance
- DE Wastewater Surveillance - RKI AMELAG SARS-CoV-2
Energy
- DE Energy Mix - Energy transition metrics (SMARD)
Philosophy
Answer First: Every dataset puts the best estimate at the top. Don't make people hunt for the number.
Ground Truth: All datasets come from authoritative, verifiable sources. We prioritize data quality and transparency over volume.
Human-Readable + Machine-Parseable: Data is stored in CSV and Markdown formats—no opaque databases. Anyone (human or AI) can read, understand, and analyze these datasets with minimal friction.
Confidence-Aware: Every estimate includes confidence levels. We distinguish between what we know well (99%+) and what's uncertain (65%).
Traceable: Every number links to its authoritative source. Changes are logged with reasons.
Data Quality Standards
Mandatory Requirements
- Confidence level - Every estimate needs uncertainty bounds
- Last updated - When data was most recently validated
- Source links - Authoritative URLs for verification
- Changelog - Track revisions with reasons
Quality Indicators
- Accuracy: Data from verified, authoritative sources
- Completeness: Gaps and missing data documented
- Timeliness: Update frequency and freshness noted
- Transparency: Methodology documented and reproducible
Contributing Datasets
When adding new datasets:
- Use the template - Start with DATASET-TEMPLATE.md
- Answer first - Create SUMMARY.md with 🎯 BEST ESTIMATE at top
- Verify sources - Use authoritative, primary sources
- Set confidence - Use the confidence level guidelines
- Document changes - Include changelog from day one
- Link thoroughly - Every number should trace to a source
Anti-Patterns to Avoid
- Burying the answer - Never make someone scroll to find the number
- No confidence level - Every estimate needs uncertainty bounds
- Stale dates - Always show when last validated
- Methodology before answer - People want the answer first
- No changelog - Revisions without history erode trust
Integration with Substrate
Data sources support other Substrate components:
- Claims can be backed by datasets with linked evidence
- Arguments can reference specific metrics and sources
- Solutions can be evaluated using ground-truth indicators
- Plans can track progress with authoritative data
Relationship with Research Projects
The Data directory works with research/ to maintain traceability between research and resulting datasets.
Research → Data Workflow:
- Input: Research projects use
Data/sources/for external APIs - Analysis: Research performs synthesis and investigation
- Output: Curated datasets stored in
Data/with SUMMARY.md - Documentation: Methodology and sources fully documented
Key Principles:
- Each dataset includes
source.mddocumenting origin - Research projects document which sources they used
- Bidirectional links maintain complete traceability
- Changes tracked in both research notes and dataset changelogs
Mission: Build a trusted foundation of ground-truth data to support human understanding and progress.