Two new datasets with source catalogs (DS-00019, DS-00020): - DE-Wealth-Distribution: Wealth Gini 72.4% (PHF 2023), top shares, inheritance - EU-Wealth-Inequality: Cross-country Gini comparison (50.8–72.6) Updated README.md, Data/README.md, QUICK_REFERENCE.md to document all 24 datasets (7 US + 16 DE + 1 EU), 20 source catalogs, and DE-Plan integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
291 lines
13 KiB
Markdown
291 lines
13 KiB
Markdown
# Data
|
||
|
||
## Purpose
|
||
|
||
The Data directory contains curated, ground-truth datasets about important aspects of human life, society, and progress, along with documentation for external data sources. This is a collection of reliable, parseable data that can be used for analysis, research, and informed decision-making.
|
||
|
||
---
|
||
|
||
## 🎯 "Answer First" Schema
|
||
|
||
**All Substrate datasets follow the "Answer First" schema.** Every dataset has a `SUMMARY.md` file that puts the best estimate at the top.
|
||
|
||
### Quick Reference
|
||
|
||
**🇺🇸 United States**
|
||
|
||
| Dataset | Best Estimate | One-liner |
|
||
|---------|--------------|-----------|
|
||
| [Knowledge Worker Compensation](./Knowledge-Worker-Global-Salaries/SUMMARY.md) | $35-50T global, $6-12T US | Global knowledge workers earn $35-50T annually |
|
||
| [US GDP](./US-GDP/SUMMARY.md) | $23.77T (Q2 2025) | U.S. real GDP is $23.77T, growing 3.8% quarterly |
|
||
| [US Inflation](./US-Inflation/SUMMARY.md) | 2.5% YoY | U.S. inflation is ~2.5% with CPI at 323.4 |
|
||
| [Presidential Approval](./US-Presidential-Approval/SUMMARY.md) | ~41% (Trump Nov 2025) | Trump approval averages ~41% (net -13) |
|
||
| [COVID Wastewater](./Bay-Area-COVID-Wastewater/SUMMARY.md) | HIGH (5.6 log10) | California COVID wastewater is HIGH |
|
||
| [US Common Metrics](./US-Common-Metrics/SUMMARY.md) | 60+ indicators | Real-time dashboard of U.S. economic indicators |
|
||
| [Pulitzer Winners](./Pulitzer-Prize-Winners/SUMMARY.md) | 249 winners | Complete Arts & Letters database (1918-2024) |
|
||
|
||
**🇩🇪 Germany**
|
||
|
||
| Dataset | Best Estimate | One-liner |
|
||
|---------|--------------|-----------|
|
||
| [DE Common Metrics](./DE-Common-Metrics/DE-Common-Metrics.md) | 29 indicators | Real-time dashboard of German economic indicators |
|
||
| [DE Wealth Distribution](./DE-Wealth-Distribution/DE-Wealth-Distribution.md) | Gini 72.4% | Top 10% hold 54% of wealth, bottom 50% just 2.4% |
|
||
| [DE Social Mobility](./DE-Social-Mobility/DE-Social-Mobility.md) | 61% vs 14% | Gymnasium rate gap by parental education |
|
||
| [DE Democracy Metrics](./DE-Democracy-Metrics/DE-Democracy-Metrics.md) | Multiple | Democratic resilience indicators |
|
||
| [DE Social Isolation](./DE-Social-Isolation/) | Multiple | Social connection and loneliness metrics |
|
||
| [DE Mental Health](./DE-Mental-Health/) | Multiple | Mental health indicators |
|
||
| [DE Wellbeing](./DE-Wellbeing/) | Multiple | Wellbeing indicators |
|
||
| [DE World Values](./DE-World-Values/) | Multiple | Value orientations (WVS/EVS) |
|
||
| [DE Platform Media](./DE-Platform-Media/) | Multiple | Platform dependency metrics |
|
||
| [DE Epistemic Competence](./DE-Epistemic-Competence/) | Multiple | Media literacy indicators |
|
||
| [DE Parliament Activity](./DE-Parliament-Activity/) | Multiple | Bundestag legislative activity |
|
||
| [DE Lobby Transparency](./DE-Lobby-Transparency/) | Multiple | Lobbyregister metrics |
|
||
| [DE Federal Budget](./DE-Federal-Budget/) | Multiple | Bundeshaushalt data |
|
||
| [DE Energy Mix](./DE-Energy-Mix/) | Multiple | Energy transition metrics |
|
||
| [DE Church Exits](./DE-Church-Exits/) | Multiple | Kirchenaustritte |
|
||
| [DE Wastewater](./DE-Wastewater-Surveillance/) | Multiple | SARS-CoV-2 wastewater levels |
|
||
|
||
**🇪🇺 European Union**
|
||
|
||
| Dataset | Best Estimate | One-liner |
|
||
|---------|--------------|-----------|
|
||
| [EU Wealth Inequality](./EU-Wealth-Inequality/EU-Wealth-Inequality.md) | Gini 50.8–72.6 | Germany highest wealth Gini in Eurozone, Slovakia lowest |
|
||
|
||
### Schema Structure
|
||
|
||
Every `SUMMARY.md` follows this structure:
|
||
|
||
```markdown
|
||
# [Dataset Title]: Executive Summary
|
||
|
||
## 🎯 BEST ESTIMATE
|
||
|
||
| Metric | Value | Confidence | Last Updated |
|
||
|--------|-------|------------|--------------|
|
||
| **[Primary Metric]** | **[VALUE]** | [X%] | [DATE] |
|
||
|
||
**One-liner:** [12 words max - the quotable answer]
|
||
|
||
**Caveat:** [Single most important limitation]
|
||
|
||
---
|
||
|
||
## The Big Picture
|
||
[2-3 sentences: What this is, why it matters, major uncertainty]
|
||
|
||
## Why This Number Matters
|
||
[Context for why this metric is important]
|
||
|
||
## How the Number Is Calculated
|
||
[Methodology summary]
|
||
|
||
## Confidence Assessment
|
||
[What we know well vs. what's uncertain]
|
||
|
||
## Alternative Estimates & Why We Differ
|
||
[When applicable: other approaches and why we chose ours]
|
||
|
||
## Data Sources
|
||
[Links to authoritative sources]
|
||
|
||
## Supporting Documentation
|
||
[Links to detailed data files]
|
||
|
||
## Changelog
|
||
[When estimates changed and why]
|
||
```
|
||
|
||
### Confidence Level Guidelines
|
||
|
||
| Level | Percentage | When to Use |
|
||
|-------|------------|-------------|
|
||
| **Very High** | 95%+ | Official government data, single authoritative source |
|
||
| **High** | 85-94% | Multiple corroborating sources, minor definitional variation |
|
||
| **Medium** | 65-84% | Extrapolated from good sources, definitional uncertainty |
|
||
| **Low** | <65% | Limited data, significant methodological issues |
|
||
|
||
### Creating New Datasets
|
||
|
||
Use the [DATASET-TEMPLATE.md](./DATASET-TEMPLATE.md) when creating new datasets.
|
||
|
||
**Mandatory Sections:**
|
||
1. **🎯 BEST ESTIMATE** - Must be first content section after title
|
||
2. **One-liner** - 12 words max, quotable
|
||
3. **Caveat** - Single most important limitation
|
||
4. **Methodology Summary** - How the estimate was derived
|
||
5. **Sources** - Authoritative links
|
||
6. **Changelog** - Track revisions with reasons
|
||
|
||
**Recommended Section:**
|
||
- **Alternative Estimates & Why We Differ** - When other estimates exist
|
||
|
||
---
|
||
|
||
## Directory Structure
|
||
|
||
```
|
||
Data/
|
||
├── DATASET-TEMPLATE.md # Schema template for new datasets
|
||
├── README.md # This file
|
||
├── sources/ # External data source catalog (DS-00001 – DS-00020)
|
||
│
|
||
│ # 🇺🇸 US Datasets
|
||
├── Bay-Area-COVID-Wastewater/ # COVID wastewater surveillance
|
||
├── Knowledge-Worker-Global-Salaries/ # Knowledge economy compensation
|
||
├── Pulitzer-Prize-Winners/ # Arts & Letters Pulitzer data
|
||
├── US-Common-Metrics/ # 60+ US economic indicators
|
||
├── US-GDP/ # US GDP data (1929–2025)
|
||
├── US-Inflation/ # CPI/inflation data (1947–2025)
|
||
├── US-Presidential-Approval/ # Approval ratings (1937–2025)
|
||
│
|
||
│ # 🇩🇪 Germany Datasets (DE-Plan)
|
||
├── DE-Church-Exits/ # Kirchenaustritte
|
||
├── DE-Common-Metrics/ # 29 economic/demographic indicators
|
||
├── DE-Democracy-Metrics/ # Democratic resilience indicators
|
||
├── DE-Energy-Mix/ # Energy transition (SMARD)
|
||
├── DE-Epistemic-Competence/ # Media literacy (PISA/ICILS)
|
||
├── DE-Federal-Budget/ # Bundeshaushalt data
|
||
├── DE-Lobby-Transparency/ # Lobbyregister Bundestag
|
||
├── DE-Mental-Health/ # Mental health indicators
|
||
├── DE-Parliament-Activity/ # Bundestag legislative activity (DIP)
|
||
├── DE-Platform-Media/ # Platform dependency metrics
|
||
├── DE-Social-Isolation/ # Social connection metrics
|
||
├── DE-Social-Mobility/ # Education stratification
|
||
├── DE-Wastewater-Surveillance/ # RKI AMELAG SARS-CoV-2
|
||
├── DE-Wealth-Distribution/ # Wealth Gini, top shares, inheritance
|
||
├── DE-Wellbeing/ # Wellbeing indicators
|
||
├── DE-World-Values/ # Value orientations (WVS/EVS)
|
||
│
|
||
│ # 🇪🇺 EU Datasets
|
||
└── EU-Wealth-Inequality/ # Cross-country wealth Gini comparison
|
||
```
|
||
|
||
**Start with the main .md file** in any dataset directory — it gives you the answer first.
|
||
|
||
---
|
||
|
||
## Dataset Categories
|
||
|
||
### Economic Indicators
|
||
- **[US GDP](./US-GDP/SUMMARY.md)** - Gross Domestic Product (1929–2025)
|
||
- **[US Inflation](./US-Inflation/SUMMARY.md)** - CPI data (1947–2025)
|
||
- **[US Common Metrics](./US-Common-Metrics/SUMMARY.md)** - 60+ economic indicators dashboard
|
||
- **[DE Common Metrics](./DE-Common-Metrics/DE-Common-Metrics.md)** - 29 German economic/demographic indicators
|
||
- **[DE Federal Budget](./DE-Federal-Budget/)** - Bundeshaushalt data
|
||
- **[Knowledge Worker Compensation](./Knowledge-Worker-Global-Salaries/SUMMARY.md)** - Global and US compensation estimates
|
||
|
||
### Wealth & Inequality
|
||
- **[DE Wealth Distribution](./DE-Wealth-Distribution/DE-Wealth-Distribution.md)** - Wealth Gini 72.4%, top shares, inheritance
|
||
- **[EU Wealth Inequality](./EU-Wealth-Inequality/EU-Wealth-Inequality.md)** - Cross-country Gini comparison (50.8–72.6)
|
||
- **[DE Social Mobility](./DE-Social-Mobility/DE-Social-Mobility.md)** - Education stratification by parental background
|
||
|
||
### Democracy & Governance
|
||
- **[DE Democracy Metrics](./DE-Democracy-Metrics/DE-Democracy-Metrics.md)** - Democratic resilience indicators
|
||
- **[DE Parliament Activity](./DE-Parliament-Activity/)** - Bundestag legislative activity
|
||
- **[DE Lobby Transparency](./DE-Lobby-Transparency/)** - Lobbyregister Bundestag
|
||
- **[Presidential Approval](./US-Presidential-Approval/SUMMARY.md)** - US approval ratings (1937–2025)
|
||
|
||
### Media & Knowledge
|
||
- **[DE Platform Media](./DE-Platform-Media/)** - Platform dependency metrics
|
||
- **[DE Epistemic Competence](./DE-Epistemic-Competence/)** - Media literacy (PISA/ICILS)
|
||
- **[Pulitzer Winners](./Pulitzer-Prize-Winners/SUMMARY.md)** - Arts & Letters awards (1918–2024)
|
||
|
||
### Social & Health
|
||
- **[DE Social Isolation](./DE-Social-Isolation/)** - Social connection metrics
|
||
- **[DE Mental Health](./DE-Mental-Health/)** - Mental health indicators
|
||
- **[DE Wellbeing](./DE-Wellbeing/)** - Wellbeing indicators
|
||
- **[DE World Values](./DE-World-Values/)** - Value orientations (WVS/EVS)
|
||
- **[DE Church Exits](./DE-Church-Exits/)** - Kirchenaustritte
|
||
|
||
### Health & Public Safety
|
||
- **[COVID Wastewater (Bay Area)](./Bay-Area-COVID-Wastewater/SUMMARY.md)** - California wastewater surveillance
|
||
- **[DE Wastewater Surveillance](./DE-Wastewater-Surveillance/)** - RKI AMELAG SARS-CoV-2
|
||
|
||
### Energy
|
||
- **[DE Energy Mix](./DE-Energy-Mix/)** - Energy transition metrics (SMARD)
|
||
|
||
---
|
||
|
||
## Philosophy
|
||
|
||
**Answer First**: Every dataset puts the best estimate at the top. Don't make people hunt for the number.
|
||
|
||
**Ground Truth**: All datasets come from authoritative, verifiable sources. We prioritize data quality and transparency over volume.
|
||
|
||
**Human-Readable + Machine-Parseable**: Data is stored in CSV and Markdown formats—no opaque databases. Anyone (human or AI) can read, understand, and analyze these datasets with minimal friction.
|
||
|
||
**Confidence-Aware**: Every estimate includes confidence levels. We distinguish between what we know well (99%+) and what's uncertain (65%).
|
||
|
||
**Traceable**: Every number links to its authoritative source. Changes are logged with reasons.
|
||
|
||
---
|
||
|
||
## Data Quality Standards
|
||
|
||
### Mandatory Requirements
|
||
- **Confidence level** - Every estimate needs uncertainty bounds
|
||
- **Last updated** - When data was most recently validated
|
||
- **Source links** - Authoritative URLs for verification
|
||
- **Changelog** - Track revisions with reasons
|
||
|
||
### Quality Indicators
|
||
- **Accuracy**: Data from verified, authoritative sources
|
||
- **Completeness**: Gaps and missing data documented
|
||
- **Timeliness**: Update frequency and freshness noted
|
||
- **Transparency**: Methodology documented and reproducible
|
||
|
||
---
|
||
|
||
## Contributing Datasets
|
||
|
||
When adding new datasets:
|
||
|
||
1. **Use the template** - Start with [DATASET-TEMPLATE.md](./DATASET-TEMPLATE.md)
|
||
2. **Answer first** - Create SUMMARY.md with 🎯 BEST ESTIMATE at top
|
||
3. **Verify sources** - Use authoritative, primary sources
|
||
4. **Set confidence** - Use the confidence level guidelines
|
||
5. **Document changes** - Include changelog from day one
|
||
6. **Link thoroughly** - Every number should trace to a source
|
||
|
||
### Anti-Patterns to Avoid
|
||
|
||
1. **Burying the answer** - Never make someone scroll to find the number
|
||
2. **No confidence level** - Every estimate needs uncertainty bounds
|
||
3. **Stale dates** - Always show when last validated
|
||
4. **Methodology before answer** - People want the answer first
|
||
5. **No changelog** - Revisions without history erode trust
|
||
|
||
---
|
||
|
||
## Integration with Substrate
|
||
|
||
Data sources support other Substrate components:
|
||
|
||
- **Claims** can be backed by datasets with linked evidence
|
||
- **Arguments** can reference specific metrics and sources
|
||
- **Solutions** can be evaluated using ground-truth indicators
|
||
- **Plans** can track progress with authoritative data
|
||
|
||
---
|
||
|
||
## Relationship with Research Projects
|
||
|
||
The Data directory works with `research/` to maintain traceability between research and resulting datasets.
|
||
|
||
**Research → Data Workflow:**
|
||
|
||
1. **Input**: Research projects use `Data/sources/` for external APIs
|
||
2. **Analysis**: Research performs synthesis and investigation
|
||
3. **Output**: Curated datasets stored in `Data/` with SUMMARY.md
|
||
4. **Documentation**: Methodology and sources fully documented
|
||
|
||
**Key Principles:**
|
||
- Each dataset includes `source.md` documenting origin
|
||
- Research projects document which sources they used
|
||
- Bidirectional links maintain complete traceability
|
||
- Changes tracked in both research notes and dataset changelogs
|
||
|
||
---
|
||
|
||
**Mission**: Build a trusted foundation of ground-truth data to support human understanding and progress.
|