Added two comprehensive datasets with full documentation: 1. Bay Area COVID-19 Wastewater Surveillance (2022-2025) - California statewide COVID-19 wastewater data - 161 weekly data points from CDPH - Leading health indicator for viral trends - Includes automated update scripts 2. Pulitzer Prize Winners - Arts & Letters (1918-2024) - 249 winners across 107 years - Poetry, Drama, and General/Special categories - High-quality curated data from Wikidata - CSV files for each category Added master Data directory documentation (Data/README.md) describing: - Data philosophy and quality standards - All four current datasets - Contribution guidelines - File naming conventions Includes utility commands: - get-bay-area-covid-status: Analyze current COVID wastewater levels - get-california-wastewater-data: Fetch latest surveillance data Updated .gitignore to exclude large raw data files (278MB+). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
95 lines
3.5 KiB
Markdown
95 lines
3.5 KiB
Markdown
# Dataset Update Log
|
|
|
|
This file tracks all updates to the Pulitzer Prize Winners dataset.
|
|
|
|
## Update Format
|
|
|
|
Each entry should include:
|
|
- **Date**: When the update was made
|
|
- **Data Period**: Which time period the new data covers
|
|
- **Source**: URL or reference to the data source
|
|
- **Changes**: What was added, modified, or corrected
|
|
- **Records**: Number of records in dataset
|
|
|
|
---
|
|
|
|
## 2025-10-07 - Initial Arts & Letters Dataset Creation
|
|
|
|
**Data Period**: 1918 to 2024
|
|
**Source**: Wikidata SPARQL Query
|
|
**URL**: https://query.wikidata.org/
|
|
**Scope**: Arts & Letters Categories (Poetry, Drama, General/Special awards)
|
|
|
|
### Changes
|
|
- Created curated dataset with 249 unique Pulitzer Prize winners in Arts & Letters categories
|
|
- Fetched data via SPARQL query against Wikidata knowledge base
|
|
- Focused on categories with high Wikidata coverage for data quality
|
|
- Processed data:
|
|
- Converted date formats to YYYY
|
|
- Simplified category names (removed "Pulitzer Prize for" prefix)
|
|
- Deduplicated entries
|
|
- Removed work titles appearing as winner names
|
|
- Added data_source column
|
|
- Sorted by year (descending) and category
|
|
- Created category-specific CSV files:
|
|
- category-poetry.csv (105 winners)
|
|
- category-drama.csv (109 winners)
|
|
- category-general.csv (35 winners)
|
|
|
|
### Records
|
|
- **Total Winners**: 249 unique records
|
|
- **Year Range**: 1918-2024 (107 years)
|
|
- **Categories**: Poetry (105), Drama (109), General/Special (35)
|
|
- **Completeness**: High for included categories (~95%+ coverage of Poetry and Drama)
|
|
|
|
### Data Quality Notes
|
|
- High-quality, curated dataset focusing on Arts & Letters categories
|
|
- Poetry and Drama have excellent coverage across all years (1918-2024)
|
|
- Journalism categories intentionally excluded (low Wikidata coverage)
|
|
- Fiction, History, Biography, Music excluded (incomplete Wikidata coverage)
|
|
- Some entries lack work titles (when not available in Wikidata)
|
|
- Winners are primarily individuals (authors, playwrights, poets)
|
|
|
|
### Files Created
|
|
- `Pulitzer-Prize-Winners-Arts-Letters-1918-2024.csv` (combined dataset - all categories)
|
|
- `category-poetry.csv` (Poetry winners only)
|
|
- `category-drama.csv` (Drama winners only)
|
|
- `category-general.csv` (General/Special awards only)
|
|
- `README.md` (dataset documentation with research methodology)
|
|
- `RESOURCES.md` (data sources)
|
|
- `UPDATES.md` (this file)
|
|
|
|
### SPARQL Query Used
|
|
```sparql
|
|
SELECT ?winner ?winnerLabel ?awardDate ?category ?categoryLabel ?work ?workLabel
|
|
WHERE {
|
|
?winner p:P166 ?awardStatement .
|
|
?awardStatement ps:P166 ?category .
|
|
?category (wdt:P279|wdt:P31)* wd:Q46525 .
|
|
OPTIONAL { ?awardStatement pq:P585 ?awardDate . }
|
|
OPTIONAL { ?awardStatement pq:P1686 ?work . }
|
|
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
|
|
}
|
|
ORDER BY DESC(?awardDate)
|
|
```
|
|
|
|
### Known Limitations
|
|
- Not comprehensive (Wikidata does not have all Pulitzer winners)
|
|
- Category names simplified for consistency
|
|
- Work titles missing for some entries
|
|
- Does not distinguish between individual/team/organizational winners
|
|
- No finalist data included
|
|
|
|
### Future Expansion Opportunities
|
|
- Add Fiction, History, Biography categories (requires enhanced scraping)
|
|
- Add Music category (completes Arts & Letters collection)
|
|
- Add Journalism categories (requires pulitzer.org scraping, ~1,400+ winners)
|
|
- Add finalist information (available from 1980 onwards)
|
|
- Combine with demographic data for representation analysis
|
|
|
|
---
|
|
|
|
## Future Updates
|
|
|
|
New updates will be added above this line in reverse chronological order (newest first).
|