Files
Daniel Miessler 9066ad477b Add Bay Area COVID wastewater and Pulitzer Prize datasets
Added two comprehensive datasets with full documentation:

1. Bay Area COVID-19 Wastewater Surveillance (2022-2025)
   - California statewide COVID-19 wastewater data
   - 161 weekly data points from CDPH
   - Leading health indicator for viral trends
   - Includes automated update scripts

2. Pulitzer Prize Winners - Arts & Letters (1918-2024)
   - 249 winners across 107 years
   - Poetry, Drama, and General/Special categories
   - High-quality curated data from Wikidata
   - CSV files for each category

Added master Data directory documentation (Data/README.md) describing:
- Data philosophy and quality standards
- All four current datasets
- Contribution guidelines
- File naming conventions

Includes utility commands:
- get-bay-area-covid-status: Analyze current COVID wastewater levels
- get-california-wastewater-data: Fetch latest surveillance data

Updated .gitignore to exclude large raw data files (278MB+).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-16 22:09:43 -07:00

3.5 KiB

Dataset Update Log

This file tracks all updates to the Pulitzer Prize Winners dataset.

Update Format

Each entry should include:

  • Date: When the update was made
  • Data Period: Which time period the new data covers
  • Source: URL or reference to the data source
  • Changes: What was added, modified, or corrected
  • Records: Number of records in dataset

2025-10-07 - Initial Arts & Letters Dataset Creation

Data Period: 1918 to 2024 Source: Wikidata SPARQL Query URL: https://query.wikidata.org/ Scope: Arts & Letters Categories (Poetry, Drama, General/Special awards)

Changes

  • Created curated dataset with 249 unique Pulitzer Prize winners in Arts & Letters categories
  • Fetched data via SPARQL query against Wikidata knowledge base
  • Focused on categories with high Wikidata coverage for data quality
  • Processed data:
    • Converted date formats to YYYY
    • Simplified category names (removed "Pulitzer Prize for" prefix)
    • Deduplicated entries
    • Removed work titles appearing as winner names
    • Added data_source column
    • Sorted by year (descending) and category
  • Created category-specific CSV files:
    • category-poetry.csv (105 winners)
    • category-drama.csv (109 winners)
    • category-general.csv (35 winners)

Records

  • Total Winners: 249 unique records
  • Year Range: 1918-2024 (107 years)
  • Categories: Poetry (105), Drama (109), General/Special (35)
  • Completeness: High for included categories (~95%+ coverage of Poetry and Drama)

Data Quality Notes

  • High-quality, curated dataset focusing on Arts & Letters categories
  • Poetry and Drama have excellent coverage across all years (1918-2024)
  • Journalism categories intentionally excluded (low Wikidata coverage)
  • Fiction, History, Biography, Music excluded (incomplete Wikidata coverage)
  • Some entries lack work titles (when not available in Wikidata)
  • Winners are primarily individuals (authors, playwrights, poets)

Files Created

  • Pulitzer-Prize-Winners-Arts-Letters-1918-2024.csv (combined dataset - all categories)
  • category-poetry.csv (Poetry winners only)
  • category-drama.csv (Drama winners only)
  • category-general.csv (General/Special awards only)
  • README.md (dataset documentation with research methodology)
  • RESOURCES.md (data sources)
  • UPDATES.md (this file)

SPARQL Query Used

SELECT ?winner ?winnerLabel ?awardDate ?category ?categoryLabel ?work ?workLabel
WHERE {
  ?winner p:P166 ?awardStatement .
  ?awardStatement ps:P166 ?category .
  ?category (wdt:P279|wdt:P31)* wd:Q46525 .
  OPTIONAL { ?awardStatement pq:P585 ?awardDate . }
  OPTIONAL { ?awardStatement pq:P1686 ?work . }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY DESC(?awardDate)

Known Limitations

  • Not comprehensive (Wikidata does not have all Pulitzer winners)
  • Category names simplified for consistency
  • Work titles missing for some entries
  • Does not distinguish between individual/team/organizational winners
  • No finalist data included

Future Expansion Opportunities

  • Add Fiction, History, Biography categories (requires enhanced scraping)
  • Add Music category (completes Arts & Letters collection)
  • Add Journalism categories (requires pulitzer.org scraping, ~1,400+ winners)
  • Add finalist information (available from 1980 onwards)
  • Combine with demographic data for representation analysis

Future Updates

New updates will be added above this line in reverse chronological order (newest first).