Files
Daniel Miessler 9066ad477b Add Bay Area COVID wastewater and Pulitzer Prize datasets
Added two comprehensive datasets with full documentation:

1. Bay Area COVID-19 Wastewater Surveillance (2022-2025)
   - California statewide COVID-19 wastewater data
   - 161 weekly data points from CDPH
   - Leading health indicator for viral trends
   - Includes automated update scripts

2. Pulitzer Prize Winners - Arts & Letters (1918-2024)
   - 249 winners across 107 years
   - Poetry, Drama, and General/Special categories
   - High-quality curated data from Wikidata
   - CSV files for each category

Added master Data directory documentation (Data/README.md) describing:
- Data philosophy and quality standards
- All four current datasets
- Contribution guidelines
- File naming conventions

Includes utility commands:
- get-bay-area-covid-status: Analyze current COVID wastewater levels
- get-california-wastewater-data: Fetch latest surveillance data

Updated .gitignore to exclude large raw data files (278MB+).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-10-16 22:09:43 -07:00

95 lines
3.5 KiB
Markdown

# Dataset Update Log
This file tracks all updates to the Pulitzer Prize Winners dataset.
## Update Format
Each entry should include:
- **Date**: When the update was made
- **Data Period**: Which time period the new data covers
- **Source**: URL or reference to the data source
- **Changes**: What was added, modified, or corrected
- **Records**: Number of records in dataset
---
## 2025-10-07 - Initial Arts & Letters Dataset Creation
**Data Period**: 1918 to 2024
**Source**: Wikidata SPARQL Query
**URL**: https://query.wikidata.org/
**Scope**: Arts & Letters Categories (Poetry, Drama, General/Special awards)
### Changes
- Created curated dataset with 249 unique Pulitzer Prize winners in Arts & Letters categories
- Fetched data via SPARQL query against Wikidata knowledge base
- Focused on categories with high Wikidata coverage for data quality
- Processed data:
- Converted date formats to YYYY
- Simplified category names (removed "Pulitzer Prize for" prefix)
- Deduplicated entries
- Removed work titles appearing as winner names
- Added data_source column
- Sorted by year (descending) and category
- Created category-specific CSV files:
- category-poetry.csv (105 winners)
- category-drama.csv (109 winners)
- category-general.csv (35 winners)
### Records
- **Total Winners**: 249 unique records
- **Year Range**: 1918-2024 (107 years)
- **Categories**: Poetry (105), Drama (109), General/Special (35)
- **Completeness**: High for included categories (~95%+ coverage of Poetry and Drama)
### Data Quality Notes
- High-quality, curated dataset focusing on Arts & Letters categories
- Poetry and Drama have excellent coverage across all years (1918-2024)
- Journalism categories intentionally excluded (low Wikidata coverage)
- Fiction, History, Biography, Music excluded (incomplete Wikidata coverage)
- Some entries lack work titles (when not available in Wikidata)
- Winners are primarily individuals (authors, playwrights, poets)
### Files Created
- `Pulitzer-Prize-Winners-Arts-Letters-1918-2024.csv` (combined dataset - all categories)
- `category-poetry.csv` (Poetry winners only)
- `category-drama.csv` (Drama winners only)
- `category-general.csv` (General/Special awards only)
- `README.md` (dataset documentation with research methodology)
- `RESOURCES.md` (data sources)
- `UPDATES.md` (this file)
### SPARQL Query Used
```sparql
SELECT ?winner ?winnerLabel ?awardDate ?category ?categoryLabel ?work ?workLabel
WHERE {
?winner p:P166 ?awardStatement .
?awardStatement ps:P166 ?category .
?category (wdt:P279|wdt:P31)* wd:Q46525 .
OPTIONAL { ?awardStatement pq:P585 ?awardDate . }
OPTIONAL { ?awardStatement pq:P1686 ?work . }
SERVICE wikibase:label { bd:serviceParam wikibase:language "en" }
}
ORDER BY DESC(?awardDate)
```
### Known Limitations
- Not comprehensive (Wikidata does not have all Pulitzer winners)
- Category names simplified for consistency
- Work titles missing for some entries
- Does not distinguish between individual/team/organizational winners
- No finalist data included
### Future Expansion Opportunities
- Add Fiction, History, Biography categories (requires enhanced scraping)
- Add Music category (completes Arts & Letters collection)
- Add Journalism categories (requires pulitzer.org scraping, ~1,400+ winners)
- Add finalist information (available from 1980 onwards)
- Combine with demographic data for representation analysis
---
## Future Updates
New updates will be added above this line in reverse chronological order (newest first).