Files
Daniel Miessler 9066ad477b Add Bay Area COVID wastewater and Pulitzer Prize datasets
Added two comprehensive datasets with full documentation:

1. Bay Area COVID-19 Wastewater Surveillance (2022-2025)
   - California statewide COVID-19 wastewater data
   - 161 weekly data points from CDPH
   - Leading health indicator for viral trends
   - Includes automated update scripts

2. Pulitzer Prize Winners - Arts & Letters (1918-2024)
   - 249 winners across 107 years
   - Poetry, Drama, and General/Special categories
   - High-quality curated data from Wikidata
   - CSV files for each category

Added master Data directory documentation (Data/README.md) describing:
- Data philosophy and quality standards
- All four current datasets
- Contribution guidelines
- File naming conventions

Includes utility commands:
- get-bay-area-covid-status: Analyze current COVID wastewater levels
- get-california-wastewater-data: Fetch latest surveillance data

Updated .gitignore to exclude large raw data files (278MB+).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-16 22:09:43 -07:00

5.2 KiB

Bay Area COVID-19 Wastewater Surveillance Dataset

Overview

This directory contains ground-truth COVID-19 wastewater surveillance data for California (which serves as a proxy for the San Francisco Bay Area). Wastewater monitoring is a leading indicator for disease trends, typically showing viral activity 4-7 days before clinical testing reports.

What's Inside

  • COVID-Wastewater-California-Statewide-2022-2025.csv - Main dataset (161 weekly data points)
  • COVID-Wastewater-SF-Bay-Area-2023-2025.md - Detailed metadata and research documentation
  • README.md - This file
  • UPDATES.md - Change log for data updates

Data Source Research

How This Source Was Identified

I conducted comprehensive parallel research using multiple search strategies:

  1. Research Process:

    • Identified wastewater surveillance as the gold standard for population-level COVID monitoring
    • Searched for authoritative government and academic sources
    • Evaluated California Department of Public Health (CDPH), CDC NWSS, and WastewaterSCAN
    • Verified data accessibility, update frequency, and format quality
  2. Primary Source Selected: California Department of Public Health (CDPH)

  3. Alternative Sources Evaluated:

Why This Source Is Reputable

Authority & Credibility

  1. Official Government Source

    • Published by California Department of Public Health
    • Part of California's official public health surveillance infrastructure
    • Data used by state decision-makers for policy and resource allocation
  2. Scientific Rigor

    • Uses validated qPCR and ddPCR detection methods
    • Data collected from 12+ wastewater treatment plants across Bay Area
    • Flow-adjusted and PMMoV-normalized for accuracy
    • Peer-reviewed methodology
  3. Transparency

    • Public domain data (U.S. government)
    • Direct CSV download available
    • Clear data dictionary and methodology documentation
    • Weekly updates every Friday
  4. Reliability Indicators

    • Temporal Consistency: Uninterrupted weekly updates since 2022
    • Geographic Coverage: Bay Area counties (SF, Alameda, Santa Clara, Contra Costa, Marin, San Mateo)
    • Multiple Sites: Redundant sampling across 12+ treatment plants
    • Validation: Cross-referenced with CDC NWSS and clinical data trends
  5. Leading Indicator Status

    • Wastewater shows trends 4-7 days before clinical testing
    • Captures all cases: symptomatic, asymptomatic, unreported
    • Population-level surveillance (not subject to testing bias)

Dataset Specifications

Coverage

  • Geographic: California Statewide (includes all Bay Area counties)
  • Temporal: July 2022 - August 2025 (ongoing)
  • Frequency: Weekly updates (data released Fridays)

Metrics

  • Primary Measurement: SARS-CoV-2 viral gene copies per milliliter
  • Format: Log10 transformed concentration values
  • Units: log10(copies/mL)

Data Quality

  • Completeness: 161/161 weeks (100% coverage)
  • Reliability: High (government source, multiple sampling sites)
  • Timeliness: Weekly updates maintained consistently
  • Accessibility: Direct CSV download, no authentication required

Geographic Context

Bay Area Counties Monitored

  • San Francisco
  • Alameda (EBMUD)
  • Santa Clara
  • Contra Costa
  • Marin (6 sites)
  • San Mateo

Major Treatment Plants

  • East Bay Municipal Utility District (EBMUD)
  • Central Marin Sanitation Agency
  • Novato Sanitary District
  • Plus 9+ additional sites

Use Cases

This dataset supports:

  • Public Health Analysis: Monitoring disease trends and outbreak detection
  • Policy Research: Evidence-based decision-making for health interventions
  • Trend Analysis: Understanding seasonal patterns and variant emergence
  • Academic Research: Population-level epidemiology studies
  • Substrate Integration: Supporting Claims, Arguments, and Solutions with ground-truth data

Data Interpretation Notes

  1. Log Scale: Values are log10 transformed - each unit increase = 10x viral load
  2. Relative Trends: Focus on directional changes, not absolute values
  3. Seasonal Context: Winter peaks typically higher due to indoor transmission
  4. Leading Indicator: Wastewater rises 4-7 days before case counts
  5. Population-Level: Represents community spread, not individual cases

Current Status (as of 2025-10-07)

  • Latest Reading: 5.60 log10 copies/mL (Week ending 2025-08-02)
  • Trend: Elevated and increasing from spring lows
  • Context: HIGH wastewater activity across California
  • Historical Peak: 18.97 log10 (Week ending 2022-07-09)
  • Recent Low: 1.60 log10 (Week ending 2025-03-15)

Maintenance

See UPDATES.md for detailed change log of data refreshes and updates.


Last Updated: 2025-10-07 Maintained By: Substrate Data Curation Update Frequency: Check weekly for new data (Fridays)