Added two comprehensive datasets with full documentation: 1. Bay Area COVID-19 Wastewater Surveillance (2022-2025) - California statewide COVID-19 wastewater data - 161 weekly data points from CDPH - Leading health indicator for viral trends - Includes automated update scripts 2. Pulitzer Prize Winners - Arts & Letters (1918-2024) - 249 winners across 107 years - Poetry, Drama, and General/Special categories - High-quality curated data from Wikidata - CSV files for each category Added master Data directory documentation (Data/README.md) describing: - Data philosophy and quality standards - All four current datasets - Contribution guidelines - File naming conventions Includes utility commands: - get-bay-area-covid-status: Analyze current COVID wastewater levels - get-california-wastewater-data: Fetch latest surveillance data Updated .gitignore to exclude large raw data files (278MB+). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
3.5 KiB
3.5 KiB
Dataset Update Log
This file tracks all updates to the Pulitzer Prize Winners dataset.
Update Format
Each entry should include:
- Date: When the update was made
- Data Period: Which time period the new data covers
- Source: URL or reference to the data source
- Changes: What was added, modified, or corrected
- Records: Number of records in dataset
2025-10-07 - Initial Arts & Letters Dataset Creation
Data Period: 1918 to 2024 Source: Wikidata SPARQL Query URL: https://query.wikidata.org/ Scope: Arts & Letters Categories (Poetry, Drama, General/Special awards)
Changes
- Created curated dataset with 249 unique Pulitzer Prize winners in Arts & Letters categories
- Fetched data via SPARQL query against Wikidata knowledge base
- Focused on categories with high Wikidata coverage for data quality
- Processed data:
- Converted date formats to YYYY
- Simplified category names (removed "Pulitzer Prize for" prefix)
- Deduplicated entries
- Removed work titles appearing as winner names
- Added data_source column
- Sorted by year (descending) and category
- Created category-specific CSV files:
- category-poetry.csv (105 winners)
- category-drama.csv (109 winners)
- category-general.csv (35 winners)
Records
- Total Winners: 249 unique records
- Year Range: 1918-2024 (107 years)
- Categories: Poetry (105), Drama (109), General/Special (35)
- Completeness: High for included categories (~95%+ coverage of Poetry and Drama)
Data Quality Notes
- High-quality, curated dataset focusing on Arts & Letters categories
- Poetry and Drama have excellent coverage across all years (1918-2024)
- Journalism categories intentionally excluded (low Wikidata coverage)
- Fiction, History, Biography, Music excluded (incomplete Wikidata coverage)
- Some entries lack work titles (when not available in Wikidata)
- Winners are primarily individuals (authors, playwrights, poets)
Files Created
Pulitzer-Prize-Winners-Arts-Letters-1918-2024.csv(combined dataset - all categories)category-poetry.csv(Poetry winners only)category-drama.csv(Drama winners only)category-general.csv(General/Special awards only)README.md(dataset documentation with research methodology)RESOURCES.md(data sources)UPDATES.md(this file)
SPARQL Query Used
SELECT ?winner ?winnerLabel ?awardDate ?category ?categoryLabel ?work ?workLabel
WHERE {
?winner p:P166 ?awardStatement .
?awardStatement ps:P166 ?category .
?category (wdt:P279|wdt:P31)* wd:Q46525 .
OPTIONAL { ?awardStatement pq:P585 ?awardDate . }
OPTIONAL { ?awardStatement pq:P1686 ?work . }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY DESC(?awardDate)
Known Limitations
- Not comprehensive (Wikidata does not have all Pulitzer winners)
- Category names simplified for consistency
- Work titles missing for some entries
- Does not distinguish between individual/team/organizational winners
- No finalist data included
Future Expansion Opportunities
- Add Fiction, History, Biography categories (requires enhanced scraping)
- Add Music category (completes Arts & Letters collection)
- Add Journalism categories (requires pulitzer.org scraping, ~1,400+ winners)
- Add finalist information (available from 1980 onwards)
- Combine with demographic data for representation analysis
Future Updates
New updates will be added above this line in reverse chronological order (newest first).