Files
Substrate/Data/README.md
svemagie e76ab2d43a feat: add DE-Wealth-Distribution, EU-Wealth-Inequality datasets + update repo docs
Two new datasets with source catalogs (DS-00019, DS-00020):
- DE-Wealth-Distribution: Wealth Gini 72.4% (PHF 2023), top shares, inheritance
- EU-Wealth-Inequality: Cross-country Gini comparison (50.8–72.6)

Updated README.md, Data/README.md, QUICK_REFERENCE.md to document all 24 datasets
(7 US + 16 DE + 1 EU), 20 source catalogs, and DE-Plan integration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 12:03:44 +02:00

13 KiB
Raw Permalink Blame History

Data

Purpose

The Data directory contains curated, ground-truth datasets about important aspects of human life, society, and progress, along with documentation for external data sources. This is a collection of reliable, parseable data that can be used for analysis, research, and informed decision-making.


🎯 "Answer First" Schema

All Substrate datasets follow the "Answer First" schema. Every dataset has a SUMMARY.md file that puts the best estimate at the top.

Quick Reference

🇺🇸 United States

Dataset Best Estimate One-liner
Knowledge Worker Compensation $35-50T global, $6-12T US Global knowledge workers earn $35-50T annually
US GDP $23.77T (Q2 2025) U.S. real GDP is $23.77T, growing 3.8% quarterly
US Inflation 2.5% YoY U.S. inflation is ~2.5% with CPI at 323.4
Presidential Approval ~41% (Trump Nov 2025) Trump approval averages ~41% (net -13)
COVID Wastewater HIGH (5.6 log10) California COVID wastewater is HIGH
US Common Metrics 60+ indicators Real-time dashboard of U.S. economic indicators
Pulitzer Winners 249 winners Complete Arts & Letters database (1918-2024)

🇩🇪 Germany

Dataset Best Estimate One-liner
DE Common Metrics 29 indicators Real-time dashboard of German economic indicators
DE Wealth Distribution Gini 72.4% Top 10% hold 54% of wealth, bottom 50% just 2.4%
DE Social Mobility 61% vs 14% Gymnasium rate gap by parental education
DE Democracy Metrics Multiple Democratic resilience indicators
DE Social Isolation Multiple Social connection and loneliness metrics
DE Mental Health Multiple Mental health indicators
DE Wellbeing Multiple Wellbeing indicators
DE World Values Multiple Value orientations (WVS/EVS)
DE Platform Media Multiple Platform dependency metrics
DE Epistemic Competence Multiple Media literacy indicators
DE Parliament Activity Multiple Bundestag legislative activity
DE Lobby Transparency Multiple Lobbyregister metrics
DE Federal Budget Multiple Bundeshaushalt data
DE Energy Mix Multiple Energy transition metrics
DE Church Exits Multiple Kirchenaustritte
DE Wastewater Multiple SARS-CoV-2 wastewater levels

🇪🇺 European Union

Dataset Best Estimate One-liner
EU Wealth Inequality Gini 50.872.6 Germany highest wealth Gini in Eurozone, Slovakia lowest

Schema Structure

Every SUMMARY.md follows this structure:

# [Dataset Title]: Executive Summary

## 🎯 BEST ESTIMATE

| Metric | Value | Confidence | Last Updated |
|--------|-------|------------|--------------|
| **[Primary Metric]** | **[VALUE]** | [X%] | [DATE] |

**One-liner:** [12 words max - the quotable answer]

**Caveat:** [Single most important limitation]

---

## The Big Picture
[2-3 sentences: What this is, why it matters, major uncertainty]

## Why This Number Matters
[Context for why this metric is important]

## How the Number Is Calculated
[Methodology summary]

## Confidence Assessment
[What we know well vs. what's uncertain]

## Alternative Estimates & Why We Differ
[When applicable: other approaches and why we chose ours]

## Data Sources
[Links to authoritative sources]

## Supporting Documentation
[Links to detailed data files]

## Changelog
[When estimates changed and why]

Confidence Level Guidelines

Level Percentage When to Use
Very High 95%+ Official government data, single authoritative source
High 85-94% Multiple corroborating sources, minor definitional variation
Medium 65-84% Extrapolated from good sources, definitional uncertainty
Low <65% Limited data, significant methodological issues

Creating New Datasets

Use the DATASET-TEMPLATE.md when creating new datasets.

Mandatory Sections:

  1. 🎯 BEST ESTIMATE - Must be first content section after title
  2. One-liner - 12 words max, quotable
  3. Caveat - Single most important limitation
  4. Methodology Summary - How the estimate was derived
  5. Sources - Authoritative links
  6. Changelog - Track revisions with reasons

Recommended Section:

  • Alternative Estimates & Why We Differ - When other estimates exist

Directory Structure

Data/
├── DATASET-TEMPLATE.md                    # Schema template for new datasets
├── README.md                              # This file
├── sources/                               # External data source catalog (DS-00001  DS-00020)
│
│   # 🇺🇸 US Datasets
├── Bay-Area-COVID-Wastewater/             # COVID wastewater surveillance
├── Knowledge-Worker-Global-Salaries/      # Knowledge economy compensation
├── Pulitzer-Prize-Winners/                # Arts & Letters Pulitzer data
├── US-Common-Metrics/                     # 60+ US economic indicators
├── US-GDP/                                # US GDP data (19292025)
├── US-Inflation/                          # CPI/inflation data (19472025)
├── US-Presidential-Approval/              # Approval ratings (19372025)
│
│   # 🇩🇪 Germany Datasets (DE-Plan)
├── DE-Church-Exits/                       # Kirchenaustritte
├── DE-Common-Metrics/                     # 29 economic/demographic indicators
├── DE-Democracy-Metrics/                  # Democratic resilience indicators
├── DE-Energy-Mix/                         # Energy transition (SMARD)
├── DE-Epistemic-Competence/               # Media literacy (PISA/ICILS)
├── DE-Federal-Budget/                     # Bundeshaushalt data
├── DE-Lobby-Transparency/                 # Lobbyregister Bundestag
├── DE-Mental-Health/                      # Mental health indicators
├── DE-Parliament-Activity/                # Bundestag legislative activity (DIP)
├── DE-Platform-Media/                     # Platform dependency metrics
├── DE-Social-Isolation/                   # Social connection metrics
├── DE-Social-Mobility/                    # Education stratification
├── DE-Wastewater-Surveillance/            # RKI AMELAG SARS-CoV-2
├── DE-Wealth-Distribution/                # Wealth Gini, top shares, inheritance
├── DE-Wellbeing/                          # Wellbeing indicators
├── DE-World-Values/                       # Value orientations (WVS/EVS)
│
│   # 🇪🇺 EU Datasets
└── EU-Wealth-Inequality/                  # Cross-country wealth Gini comparison

Start with the main .md file in any dataset directory — it gives you the answer first.


Dataset Categories

Economic Indicators

Wealth & Inequality

Democracy & Governance

Media & Knowledge

Social & Health

Health & Public Safety

Energy


Philosophy

Answer First: Every dataset puts the best estimate at the top. Don't make people hunt for the number.

Ground Truth: All datasets come from authoritative, verifiable sources. We prioritize data quality and transparency over volume.

Human-Readable + Machine-Parseable: Data is stored in CSV and Markdown formats—no opaque databases. Anyone (human or AI) can read, understand, and analyze these datasets with minimal friction.

Confidence-Aware: Every estimate includes confidence levels. We distinguish between what we know well (99%+) and what's uncertain (65%).

Traceable: Every number links to its authoritative source. Changes are logged with reasons.


Data Quality Standards

Mandatory Requirements

  • Confidence level - Every estimate needs uncertainty bounds
  • Last updated - When data was most recently validated
  • Source links - Authoritative URLs for verification
  • Changelog - Track revisions with reasons

Quality Indicators

  • Accuracy: Data from verified, authoritative sources
  • Completeness: Gaps and missing data documented
  • Timeliness: Update frequency and freshness noted
  • Transparency: Methodology documented and reproducible

Contributing Datasets

When adding new datasets:

  1. Use the template - Start with DATASET-TEMPLATE.md
  2. Answer first - Create SUMMARY.md with 🎯 BEST ESTIMATE at top
  3. Verify sources - Use authoritative, primary sources
  4. Set confidence - Use the confidence level guidelines
  5. Document changes - Include changelog from day one
  6. Link thoroughly - Every number should trace to a source

Anti-Patterns to Avoid

  1. Burying the answer - Never make someone scroll to find the number
  2. No confidence level - Every estimate needs uncertainty bounds
  3. Stale dates - Always show when last validated
  4. Methodology before answer - People want the answer first
  5. No changelog - Revisions without history erode trust

Integration with Substrate

Data sources support other Substrate components:

  • Claims can be backed by datasets with linked evidence
  • Arguments can reference specific metrics and sources
  • Solutions can be evaluated using ground-truth indicators
  • Plans can track progress with authoritative data

Relationship with Research Projects

The Data directory works with research/ to maintain traceability between research and resulting datasets.

Research → Data Workflow:

  1. Input: Research projects use Data/sources/ for external APIs
  2. Analysis: Research performs synthesis and investigation
  3. Output: Curated datasets stored in Data/ with SUMMARY.md
  4. Documentation: Methodology and sources fully documented

Key Principles:

  • Each dataset includes source.md documenting origin
  • Research projects document which sources they used
  • Bidirectional links maintain complete traceability
  • Changes tracked in both research notes and dataset changelogs

Mission: Build a trusted foundation of ground-truth data to support human understanding and progress.