# Data ## Purpose The Data directory contains curated, ground-truth datasets about important aspects of human life, society, and progress, along with documentation for external data sources. This is a collection of reliable, parseable data that can be used for analysis, research, and informed decision-making. --- ## 🎯 "Answer First" Schema **All Substrate datasets follow the "Answer First" schema.** Every dataset has a `SUMMARY.md` file that puts the best estimate at the top. ### Quick Reference **πŸ‡ΊπŸ‡Έ United States** | Dataset | Best Estimate | One-liner | |---------|--------------|-----------| | [Knowledge Worker Compensation](./Knowledge-Worker-Global-Salaries/SUMMARY.md) | $35-50T global, $6-12T US | Global knowledge workers earn $35-50T annually | | [US GDP](./US-GDP/SUMMARY.md) | $23.77T (Q2 2025) | U.S. real GDP is $23.77T, growing 3.8% quarterly | | [US Inflation](./US-Inflation/SUMMARY.md) | 2.5% YoY | U.S. inflation is ~2.5% with CPI at 323.4 | | [Presidential Approval](./US-Presidential-Approval/SUMMARY.md) | ~41% (Trump Nov 2025) | Trump approval averages ~41% (net -13) | | [COVID Wastewater](./Bay-Area-COVID-Wastewater/SUMMARY.md) | HIGH (5.6 log10) | California COVID wastewater is HIGH | | [US Common Metrics](./US-Common-Metrics/SUMMARY.md) | 60+ indicators | Real-time dashboard of U.S. economic indicators | | [Pulitzer Winners](./Pulitzer-Prize-Winners/SUMMARY.md) | 249 winners | Complete Arts & Letters database (1918-2024) | **πŸ‡©πŸ‡ͺ Germany** | Dataset | Best Estimate | One-liner | |---------|--------------|-----------| | [DE Common Metrics](./DE-Common-Metrics/DE-Common-Metrics.md) | 29 indicators | Real-time dashboard of German economic indicators | | [DE Wealth Distribution](./DE-Wealth-Distribution/DE-Wealth-Distribution.md) | Gini 72.4% | Top 10% hold 54% of wealth, bottom 50% just 2.4% | | [DE Social Mobility](./DE-Social-Mobility/DE-Social-Mobility.md) | 61% vs 14% | Gymnasium rate gap by parental education | | [DE Democracy Metrics](./DE-Democracy-Metrics/DE-Democracy-Metrics.md) | Multiple | Democratic resilience indicators | | [DE Social Isolation](./DE-Social-Isolation/) | Multiple | Social connection and loneliness metrics | | [DE Mental Health](./DE-Mental-Health/) | Multiple | Mental health indicators | | [DE Wellbeing](./DE-Wellbeing/) | Multiple | Wellbeing indicators | | [DE World Values](./DE-World-Values/) | Multiple | Value orientations (WVS/EVS) | | [DE Platform Media](./DE-Platform-Media/) | Multiple | Platform dependency metrics | | [DE Epistemic Competence](./DE-Epistemic-Competence/) | Multiple | Media literacy indicators | | [DE Parliament Activity](./DE-Parliament-Activity/) | Multiple | Bundestag legislative activity | | [DE Lobby Transparency](./DE-Lobby-Transparency/) | Multiple | Lobbyregister metrics | | [DE Federal Budget](./DE-Federal-Budget/) | Multiple | Bundeshaushalt data | | [DE Energy Mix](./DE-Energy-Mix/) | Multiple | Energy transition metrics | | [DE Church Exits](./DE-Church-Exits/) | Multiple | Kirchenaustritte | | [DE Wastewater](./DE-Wastewater-Surveillance/) | Multiple | SARS-CoV-2 wastewater levels | **πŸ‡ͺπŸ‡Ί European Union** | Dataset | Best Estimate | One-liner | |---------|--------------|-----------| | [EU Wealth Inequality](./EU-Wealth-Inequality/EU-Wealth-Inequality.md) | Gini 50.8–72.6 | Germany highest wealth Gini in Eurozone, Slovakia lowest | ### Schema Structure Every `SUMMARY.md` follows this structure: ```markdown # [Dataset Title]: Executive Summary ## 🎯 BEST ESTIMATE | Metric | Value | Confidence | Last Updated | |--------|-------|------------|--------------| | **[Primary Metric]** | **[VALUE]** | [X%] | [DATE] | **One-liner:** [12 words max - the quotable answer] **Caveat:** [Single most important limitation] --- ## The Big Picture [2-3 sentences: What this is, why it matters, major uncertainty] ## Why This Number Matters [Context for why this metric is important] ## How the Number Is Calculated [Methodology summary] ## Confidence Assessment [What we know well vs. what's uncertain] ## Alternative Estimates & Why We Differ [When applicable: other approaches and why we chose ours] ## Data Sources [Links to authoritative sources] ## Supporting Documentation [Links to detailed data files] ## Changelog [When estimates changed and why] ``` ### Confidence Level Guidelines | Level | Percentage | When to Use | |-------|------------|-------------| | **Very High** | 95%+ | Official government data, single authoritative source | | **High** | 85-94% | Multiple corroborating sources, minor definitional variation | | **Medium** | 65-84% | Extrapolated from good sources, definitional uncertainty | | **Low** | <65% | Limited data, significant methodological issues | ### Creating New Datasets Use the [DATASET-TEMPLATE.md](./DATASET-TEMPLATE.md) when creating new datasets. **Mandatory Sections:** 1. **🎯 BEST ESTIMATE** - Must be first content section after title 2. **One-liner** - 12 words max, quotable 3. **Caveat** - Single most important limitation 4. **Methodology Summary** - How the estimate was derived 5. **Sources** - Authoritative links 6. **Changelog** - Track revisions with reasons **Recommended Section:** - **Alternative Estimates & Why We Differ** - When other estimates exist --- ## Directory Structure ``` Data/ β”œβ”€β”€ DATASET-TEMPLATE.md # Schema template for new datasets β”œβ”€β”€ README.md # This file β”œβ”€β”€ sources/ # External data source catalog (DS-00001 – DS-00020) β”‚ β”‚ # πŸ‡ΊπŸ‡Έ US Datasets β”œβ”€β”€ Bay-Area-COVID-Wastewater/ # COVID wastewater surveillance β”œβ”€β”€ Knowledge-Worker-Global-Salaries/ # Knowledge economy compensation β”œβ”€β”€ Pulitzer-Prize-Winners/ # Arts & Letters Pulitzer data β”œβ”€β”€ US-Common-Metrics/ # 60+ US economic indicators β”œβ”€β”€ US-GDP/ # US GDP data (1929–2025) β”œβ”€β”€ US-Inflation/ # CPI/inflation data (1947–2025) β”œβ”€β”€ US-Presidential-Approval/ # Approval ratings (1937–2025) β”‚ β”‚ # πŸ‡©πŸ‡ͺ Germany Datasets (DE-Plan) β”œβ”€β”€ DE-Church-Exits/ # Kirchenaustritte β”œβ”€β”€ DE-Common-Metrics/ # 29 economic/demographic indicators β”œβ”€β”€ DE-Democracy-Metrics/ # Democratic resilience indicators β”œβ”€β”€ DE-Energy-Mix/ # Energy transition (SMARD) β”œβ”€β”€ DE-Epistemic-Competence/ # Media literacy (PISA/ICILS) β”œβ”€β”€ DE-Federal-Budget/ # Bundeshaushalt data β”œβ”€β”€ DE-Lobby-Transparency/ # Lobbyregister Bundestag β”œβ”€β”€ DE-Mental-Health/ # Mental health indicators β”œβ”€β”€ DE-Parliament-Activity/ # Bundestag legislative activity (DIP) β”œβ”€β”€ DE-Platform-Media/ # Platform dependency metrics β”œβ”€β”€ DE-Social-Isolation/ # Social connection metrics β”œβ”€β”€ DE-Social-Mobility/ # Education stratification β”œβ”€β”€ DE-Wastewater-Surveillance/ # RKI AMELAG SARS-CoV-2 β”œβ”€β”€ DE-Wealth-Distribution/ # Wealth Gini, top shares, inheritance β”œβ”€β”€ DE-Wellbeing/ # Wellbeing indicators β”œβ”€β”€ DE-World-Values/ # Value orientations (WVS/EVS) β”‚ β”‚ # πŸ‡ͺπŸ‡Ί EU Datasets └── EU-Wealth-Inequality/ # Cross-country wealth Gini comparison ``` **Start with the main .md file** in any dataset directory β€” it gives you the answer first. --- ## Dataset Categories ### Economic Indicators - **[US GDP](./US-GDP/SUMMARY.md)** - Gross Domestic Product (1929–2025) - **[US Inflation](./US-Inflation/SUMMARY.md)** - CPI data (1947–2025) - **[US Common Metrics](./US-Common-Metrics/SUMMARY.md)** - 60+ economic indicators dashboard - **[DE Common Metrics](./DE-Common-Metrics/DE-Common-Metrics.md)** - 29 German economic/demographic indicators - **[DE Federal Budget](./DE-Federal-Budget/)** - Bundeshaushalt data - **[Knowledge Worker Compensation](./Knowledge-Worker-Global-Salaries/SUMMARY.md)** - Global and US compensation estimates ### Wealth & Inequality - **[DE Wealth Distribution](./DE-Wealth-Distribution/DE-Wealth-Distribution.md)** - Wealth Gini 72.4%, top shares, inheritance - **[EU Wealth Inequality](./EU-Wealth-Inequality/EU-Wealth-Inequality.md)** - Cross-country Gini comparison (50.8–72.6) - **[DE Social Mobility](./DE-Social-Mobility/DE-Social-Mobility.md)** - Education stratification by parental background ### Democracy & Governance - **[DE Democracy Metrics](./DE-Democracy-Metrics/DE-Democracy-Metrics.md)** - Democratic resilience indicators - **[DE Parliament Activity](./DE-Parliament-Activity/)** - Bundestag legislative activity - **[DE Lobby Transparency](./DE-Lobby-Transparency/)** - Lobbyregister Bundestag - **[Presidential Approval](./US-Presidential-Approval/SUMMARY.md)** - US approval ratings (1937–2025) ### Media & Knowledge - **[DE Platform Media](./DE-Platform-Media/)** - Platform dependency metrics - **[DE Epistemic Competence](./DE-Epistemic-Competence/)** - Media literacy (PISA/ICILS) - **[Pulitzer Winners](./Pulitzer-Prize-Winners/SUMMARY.md)** - Arts & Letters awards (1918–2024) ### Social & Health - **[DE Social Isolation](./DE-Social-Isolation/)** - Social connection metrics - **[DE Mental Health](./DE-Mental-Health/)** - Mental health indicators - **[DE Wellbeing](./DE-Wellbeing/)** - Wellbeing indicators - **[DE World Values](./DE-World-Values/)** - Value orientations (WVS/EVS) - **[DE Church Exits](./DE-Church-Exits/)** - Kirchenaustritte ### Health & Public Safety - **[COVID Wastewater (Bay Area)](./Bay-Area-COVID-Wastewater/SUMMARY.md)** - California wastewater surveillance - **[DE Wastewater Surveillance](./DE-Wastewater-Surveillance/)** - RKI AMELAG SARS-CoV-2 ### Energy - **[DE Energy Mix](./DE-Energy-Mix/)** - Energy transition metrics (SMARD) --- ## Philosophy **Answer First**: Every dataset puts the best estimate at the top. Don't make people hunt for the number. **Ground Truth**: All datasets come from authoritative, verifiable sources. We prioritize data quality and transparency over volume. **Human-Readable + Machine-Parseable**: Data is stored in CSV and Markdown formatsβ€”no opaque databases. Anyone (human or AI) can read, understand, and analyze these datasets with minimal friction. **Confidence-Aware**: Every estimate includes confidence levels. We distinguish between what we know well (99%+) and what's uncertain (65%). **Traceable**: Every number links to its authoritative source. Changes are logged with reasons. --- ## Data Quality Standards ### Mandatory Requirements - **Confidence level** - Every estimate needs uncertainty bounds - **Last updated** - When data was most recently validated - **Source links** - Authoritative URLs for verification - **Changelog** - Track revisions with reasons ### Quality Indicators - **Accuracy**: Data from verified, authoritative sources - **Completeness**: Gaps and missing data documented - **Timeliness**: Update frequency and freshness noted - **Transparency**: Methodology documented and reproducible --- ## Contributing Datasets When adding new datasets: 1. **Use the template** - Start with [DATASET-TEMPLATE.md](./DATASET-TEMPLATE.md) 2. **Answer first** - Create SUMMARY.md with 🎯 BEST ESTIMATE at top 3. **Verify sources** - Use authoritative, primary sources 4. **Set confidence** - Use the confidence level guidelines 5. **Document changes** - Include changelog from day one 6. **Link thoroughly** - Every number should trace to a source ### Anti-Patterns to Avoid 1. **Burying the answer** - Never make someone scroll to find the number 2. **No confidence level** - Every estimate needs uncertainty bounds 3. **Stale dates** - Always show when last validated 4. **Methodology before answer** - People want the answer first 5. **No changelog** - Revisions without history erode trust --- ## Integration with Substrate Data sources support other Substrate components: - **Claims** can be backed by datasets with linked evidence - **Arguments** can reference specific metrics and sources - **Solutions** can be evaluated using ground-truth indicators - **Plans** can track progress with authoritative data --- ## Relationship with Research Projects The Data directory works with `research/` to maintain traceability between research and resulting datasets. **Research β†’ Data Workflow:** 1. **Input**: Research projects use `Data/sources/` for external APIs 2. **Analysis**: Research performs synthesis and investigation 3. **Output**: Curated datasets stored in `Data/` with SUMMARY.md 4. **Documentation**: Methodology and sources fully documented **Key Principles:** - Each dataset includes `source.md` documenting origin - Research projects document which sources they used - Bidirectional links maintain complete traceability - Changes tracked in both research notes and dataset changelogs --- **Mission**: Build a trusted foundation of ground-truth data to support human understanding and progress.