Files
Substrate/Data/README.md
svemagie e76ab2d43a feat: add DE-Wealth-Distribution, EU-Wealth-Inequality datasets + update repo docs
Two new datasets with source catalogs (DS-00019, DS-00020):
- DE-Wealth-Distribution: Wealth Gini 72.4% (PHF 2023), top shares, inheritance
- EU-Wealth-Inequality: Cross-country Gini comparison (50.8–72.6)

Updated README.md, Data/README.md, QUICK_REFERENCE.md to document all 24 datasets
(7 US + 16 DE + 1 EU), 20 source catalogs, and DE-Plan integration.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-04-23 12:03:44 +02:00

291 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Data
## Purpose
The Data directory contains curated, ground-truth datasets about important aspects of human life, society, and progress, along with documentation for external data sources. This is a collection of reliable, parseable data that can be used for analysis, research, and informed decision-making.
---
## 🎯 "Answer First" Schema
**All Substrate datasets follow the "Answer First" schema.** Every dataset has a `SUMMARY.md` file that puts the best estimate at the top.
### Quick Reference
**🇺🇸 United States**
| Dataset | Best Estimate | One-liner |
|---------|--------------|-----------|
| [Knowledge Worker Compensation](./Knowledge-Worker-Global-Salaries/SUMMARY.md) | $35-50T global, $6-12T US | Global knowledge workers earn $35-50T annually |
| [US GDP](./US-GDP/SUMMARY.md) | $23.77T (Q2 2025) | U.S. real GDP is $23.77T, growing 3.8% quarterly |
| [US Inflation](./US-Inflation/SUMMARY.md) | 2.5% YoY | U.S. inflation is ~2.5% with CPI at 323.4 |
| [Presidential Approval](./US-Presidential-Approval/SUMMARY.md) | ~41% (Trump Nov 2025) | Trump approval averages ~41% (net -13) |
| [COVID Wastewater](./Bay-Area-COVID-Wastewater/SUMMARY.md) | HIGH (5.6 log10) | California COVID wastewater is HIGH |
| [US Common Metrics](./US-Common-Metrics/SUMMARY.md) | 60+ indicators | Real-time dashboard of U.S. economic indicators |
| [Pulitzer Winners](./Pulitzer-Prize-Winners/SUMMARY.md) | 249 winners | Complete Arts & Letters database (1918-2024) |
**🇩🇪 Germany**
| Dataset | Best Estimate | One-liner |
|---------|--------------|-----------|
| [DE Common Metrics](./DE-Common-Metrics/DE-Common-Metrics.md) | 29 indicators | Real-time dashboard of German economic indicators |
| [DE Wealth Distribution](./DE-Wealth-Distribution/DE-Wealth-Distribution.md) | Gini 72.4% | Top 10% hold 54% of wealth, bottom 50% just 2.4% |
| [DE Social Mobility](./DE-Social-Mobility/DE-Social-Mobility.md) | 61% vs 14% | Gymnasium rate gap by parental education |
| [DE Democracy Metrics](./DE-Democracy-Metrics/DE-Democracy-Metrics.md) | Multiple | Democratic resilience indicators |
| [DE Social Isolation](./DE-Social-Isolation/) | Multiple | Social connection and loneliness metrics |
| [DE Mental Health](./DE-Mental-Health/) | Multiple | Mental health indicators |
| [DE Wellbeing](./DE-Wellbeing/) | Multiple | Wellbeing indicators |
| [DE World Values](./DE-World-Values/) | Multiple | Value orientations (WVS/EVS) |
| [DE Platform Media](./DE-Platform-Media/) | Multiple | Platform dependency metrics |
| [DE Epistemic Competence](./DE-Epistemic-Competence/) | Multiple | Media literacy indicators |
| [DE Parliament Activity](./DE-Parliament-Activity/) | Multiple | Bundestag legislative activity |
| [DE Lobby Transparency](./DE-Lobby-Transparency/) | Multiple | Lobbyregister metrics |
| [DE Federal Budget](./DE-Federal-Budget/) | Multiple | Bundeshaushalt data |
| [DE Energy Mix](./DE-Energy-Mix/) | Multiple | Energy transition metrics |
| [DE Church Exits](./DE-Church-Exits/) | Multiple | Kirchenaustritte |
| [DE Wastewater](./DE-Wastewater-Surveillance/) | Multiple | SARS-CoV-2 wastewater levels |
**🇪🇺 European Union**
| Dataset | Best Estimate | One-liner |
|---------|--------------|-----------|
| [EU Wealth Inequality](./EU-Wealth-Inequality/EU-Wealth-Inequality.md) | Gini 50.872.6 | Germany highest wealth Gini in Eurozone, Slovakia lowest |
### Schema Structure
Every `SUMMARY.md` follows this structure:
```markdown
# [Dataset Title]: Executive Summary
## 🎯 BEST ESTIMATE
| Metric | Value | Confidence | Last Updated |
|--------|-------|------------|--------------|
| **[Primary Metric]** | **[VALUE]** | [X%] | [DATE] |
**One-liner:** [12 words max - the quotable answer]
**Caveat:** [Single most important limitation]
---
## The Big Picture
[2-3 sentences: What this is, why it matters, major uncertainty]
## Why This Number Matters
[Context for why this metric is important]
## How the Number Is Calculated
[Methodology summary]
## Confidence Assessment
[What we know well vs. what's uncertain]
## Alternative Estimates & Why We Differ
[When applicable: other approaches and why we chose ours]
## Data Sources
[Links to authoritative sources]
## Supporting Documentation
[Links to detailed data files]
## Changelog
[When estimates changed and why]
```
### Confidence Level Guidelines
| Level | Percentage | When to Use |
|-------|------------|-------------|
| **Very High** | 95%+ | Official government data, single authoritative source |
| **High** | 85-94% | Multiple corroborating sources, minor definitional variation |
| **Medium** | 65-84% | Extrapolated from good sources, definitional uncertainty |
| **Low** | <65% | Limited data, significant methodological issues |
### Creating New Datasets
Use the [DATASET-TEMPLATE.md](./DATASET-TEMPLATE.md) when creating new datasets.
**Mandatory Sections:**
1. **🎯 BEST ESTIMATE** - Must be first content section after title
2. **One-liner** - 12 words max, quotable
3. **Caveat** - Single most important limitation
4. **Methodology Summary** - How the estimate was derived
5. **Sources** - Authoritative links
6. **Changelog** - Track revisions with reasons
**Recommended Section:**
- **Alternative Estimates & Why We Differ** - When other estimates exist
---
## Directory Structure
```
Data/
├── DATASET-TEMPLATE.md # Schema template for new datasets
├── README.md # This file
├── sources/ # External data source catalog (DS-00001 DS-00020)
│ # 🇺🇸 US Datasets
├── Bay-Area-COVID-Wastewater/ # COVID wastewater surveillance
├── Knowledge-Worker-Global-Salaries/ # Knowledge economy compensation
├── Pulitzer-Prize-Winners/ # Arts & Letters Pulitzer data
├── US-Common-Metrics/ # 60+ US economic indicators
├── US-GDP/ # US GDP data (19292025)
├── US-Inflation/ # CPI/inflation data (19472025)
├── US-Presidential-Approval/ # Approval ratings (19372025)
│ # 🇩🇪 Germany Datasets (DE-Plan)
├── DE-Church-Exits/ # Kirchenaustritte
├── DE-Common-Metrics/ # 29 economic/demographic indicators
├── DE-Democracy-Metrics/ # Democratic resilience indicators
├── DE-Energy-Mix/ # Energy transition (SMARD)
├── DE-Epistemic-Competence/ # Media literacy (PISA/ICILS)
├── DE-Federal-Budget/ # Bundeshaushalt data
├── DE-Lobby-Transparency/ # Lobbyregister Bundestag
├── DE-Mental-Health/ # Mental health indicators
├── DE-Parliament-Activity/ # Bundestag legislative activity (DIP)
├── DE-Platform-Media/ # Platform dependency metrics
├── DE-Social-Isolation/ # Social connection metrics
├── DE-Social-Mobility/ # Education stratification
├── DE-Wastewater-Surveillance/ # RKI AMELAG SARS-CoV-2
├── DE-Wealth-Distribution/ # Wealth Gini, top shares, inheritance
├── DE-Wellbeing/ # Wellbeing indicators
├── DE-World-Values/ # Value orientations (WVS/EVS)
│ # 🇪🇺 EU Datasets
└── EU-Wealth-Inequality/ # Cross-country wealth Gini comparison
```
**Start with the main .md file** in any dataset directory — it gives you the answer first.
---
## Dataset Categories
### Economic Indicators
- **[US GDP](./US-GDP/SUMMARY.md)** - Gross Domestic Product (19292025)
- **[US Inflation](./US-Inflation/SUMMARY.md)** - CPI data (19472025)
- **[US Common Metrics](./US-Common-Metrics/SUMMARY.md)** - 60+ economic indicators dashboard
- **[DE Common Metrics](./DE-Common-Metrics/DE-Common-Metrics.md)** - 29 German economic/demographic indicators
- **[DE Federal Budget](./DE-Federal-Budget/)** - Bundeshaushalt data
- **[Knowledge Worker Compensation](./Knowledge-Worker-Global-Salaries/SUMMARY.md)** - Global and US compensation estimates
### Wealth & Inequality
- **[DE Wealth Distribution](./DE-Wealth-Distribution/DE-Wealth-Distribution.md)** - Wealth Gini 72.4%, top shares, inheritance
- **[EU Wealth Inequality](./EU-Wealth-Inequality/EU-Wealth-Inequality.md)** - Cross-country Gini comparison (50.872.6)
- **[DE Social Mobility](./DE-Social-Mobility/DE-Social-Mobility.md)** - Education stratification by parental background
### Democracy & Governance
- **[DE Democracy Metrics](./DE-Democracy-Metrics/DE-Democracy-Metrics.md)** - Democratic resilience indicators
- **[DE Parliament Activity](./DE-Parliament-Activity/)** - Bundestag legislative activity
- **[DE Lobby Transparency](./DE-Lobby-Transparency/)** - Lobbyregister Bundestag
- **[Presidential Approval](./US-Presidential-Approval/SUMMARY.md)** - US approval ratings (19372025)
### Media & Knowledge
- **[DE Platform Media](./DE-Platform-Media/)** - Platform dependency metrics
- **[DE Epistemic Competence](./DE-Epistemic-Competence/)** - Media literacy (PISA/ICILS)
- **[Pulitzer Winners](./Pulitzer-Prize-Winners/SUMMARY.md)** - Arts & Letters awards (19182024)
### Social & Health
- **[DE Social Isolation](./DE-Social-Isolation/)** - Social connection metrics
- **[DE Mental Health](./DE-Mental-Health/)** - Mental health indicators
- **[DE Wellbeing](./DE-Wellbeing/)** - Wellbeing indicators
- **[DE World Values](./DE-World-Values/)** - Value orientations (WVS/EVS)
- **[DE Church Exits](./DE-Church-Exits/)** - Kirchenaustritte
### Health & Public Safety
- **[COVID Wastewater (Bay Area)](./Bay-Area-COVID-Wastewater/SUMMARY.md)** - California wastewater surveillance
- **[DE Wastewater Surveillance](./DE-Wastewater-Surveillance/)** - RKI AMELAG SARS-CoV-2
### Energy
- **[DE Energy Mix](./DE-Energy-Mix/)** - Energy transition metrics (SMARD)
---
## Philosophy
**Answer First**: Every dataset puts the best estimate at the top. Don't make people hunt for the number.
**Ground Truth**: All datasets come from authoritative, verifiable sources. We prioritize data quality and transparency over volume.
**Human-Readable + Machine-Parseable**: Data is stored in CSV and Markdown formats—no opaque databases. Anyone (human or AI) can read, understand, and analyze these datasets with minimal friction.
**Confidence-Aware**: Every estimate includes confidence levels. We distinguish between what we know well (99%+) and what's uncertain (65%).
**Traceable**: Every number links to its authoritative source. Changes are logged with reasons.
---
## Data Quality Standards
### Mandatory Requirements
- **Confidence level** - Every estimate needs uncertainty bounds
- **Last updated** - When data was most recently validated
- **Source links** - Authoritative URLs for verification
- **Changelog** - Track revisions with reasons
### Quality Indicators
- **Accuracy**: Data from verified, authoritative sources
- **Completeness**: Gaps and missing data documented
- **Timeliness**: Update frequency and freshness noted
- **Transparency**: Methodology documented and reproducible
---
## Contributing Datasets
When adding new datasets:
1. **Use the template** - Start with [DATASET-TEMPLATE.md](./DATASET-TEMPLATE.md)
2. **Answer first** - Create SUMMARY.md with 🎯 BEST ESTIMATE at top
3. **Verify sources** - Use authoritative, primary sources
4. **Set confidence** - Use the confidence level guidelines
5. **Document changes** - Include changelog from day one
6. **Link thoroughly** - Every number should trace to a source
### Anti-Patterns to Avoid
1. **Burying the answer** - Never make someone scroll to find the number
2. **No confidence level** - Every estimate needs uncertainty bounds
3. **Stale dates** - Always show when last validated
4. **Methodology before answer** - People want the answer first
5. **No changelog** - Revisions without history erode trust
---
## Integration with Substrate
Data sources support other Substrate components:
- **Claims** can be backed by datasets with linked evidence
- **Arguments** can reference specific metrics and sources
- **Solutions** can be evaluated using ground-truth indicators
- **Plans** can track progress with authoritative data
---
## Relationship with Research Projects
The Data directory works with `research/` to maintain traceability between research and resulting datasets.
**Research → Data Workflow:**
1. **Input**: Research projects use `Data/sources/` for external APIs
2. **Analysis**: Research performs synthesis and investigation
3. **Output**: Curated datasets stored in `Data/` with SUMMARY.md
4. **Documentation**: Methodology and sources fully documented
**Key Principles:**
- Each dataset includes `source.md` documenting origin
- Research projects document which sources they used
- Bidirectional links maintain complete traceability
- Changes tracked in both research notes and dataset changelogs
---
**Mission**: Build a trusted foundation of ground-truth data to support human understanding and progress.