Provenance & Validation¶
The Provenance Chain¶
All analysis data traces back to source PDFs:
PDF (archival source) → Excel (oDesk digitization 2013-14) → CSV (batch conversion) → Validation → Analysis
Digitization Quality
The oDesk contractors sometimes cut corners. Data in CSVs may contain transcription errors from the original digitization. No data should be used in analysis without validation against source PDFs.
Validation Methods¶
Internal Consistency Checks¶
Census tables typically include totals that should equal the sum of components. This is the most reliable validation method:
- District sums vs. regional totals — compare summed district values against published regional aggregates
- Row totals vs. column sums — cross-check within tables
- Year-over-year plausibility — flag suspiciously large changes
PDF Comparison¶
For critical data (1974-1976 crisis years), values were compared directly against source PDFs:
- Manual spot-checking of random samples
- Claude Vision API automated extraction (mixed reliability for row alignment)
- Chris Boone's independent verification of 1976 data (160 corrections applied)
Validation Scripts¶
Available in ops_admin/scripts/:
| Script | Domain | Method |
|---|---|---|
validate_ag_census_content.py |
Agriculture | Province totals vs district sums |
batch_validate_ag_census.py |
Agriculture | All years automated |
batch_validate_mining.py |
Mining | Internal consistency |
batch_validate_manufacturing.py |
Manufacturing | Internal consistency |
validate_strike_sample.py |
Strikes | Cross-reference checks |
Validation Results¶
Agricultural Census¶
| Year | Status | Error Rate | Notes |
|---|---|---|---|
| 1949 | Validated | 1-6% | Excellent quality |
| 1956 | Validated | 0-23% | Good quality |
| 1961 | Validated | 0-23% | Good quality |
| 1972 | Validated | 37 errors/57 regions | 86% pass rate |
| 1974 | Validated | 39 errors | Crisis start year |
| 1975 | Validated | Some errors | Crisis end year |
| 1976 | Corrected | 160 fixes applied | Chris Boone corrections, Dec 2025 |
| 1978-1983 | Different format | 255 cols | Wide format, not employment data |
Corrections Applied¶
All corrections are logged in shared_resources/data/ag_census/CORRECTIONS_LOG.md:
- Original files backed up to
_originals/subfolders - Changes documented with cell-level detail
- Primary error pattern: OCR misreading of digits (3↔5, 0↔O)
Master Provenance Index¶
As of March 2026, a machine-readable provenance index covers all 1,800+ CSVs:
PROVENANCE_INDEX.csv— maps every CSV to its source Excel file, source PDF, original publication, and digitization methodPROVENANCE_INDEX.md— human-readable summary grouped by domain- Generated by:
ops_admin/scripts/build_provenance_index.py(re-run to update)
Coverage: 277 files with exact per-file provenance (CSV matched to source Excel and PDF), 1,525 with domain-level provenance (source publication and method known). 100% of files have documented provenance.
Source PDF Locations¶
Source PDFs are organized in shared_resources/raw_archives/ under 11 numbered domains. Each domain has a comprehensive README with coverage tables and gap analysis (audited March 2026):
| Domain | Location | README |
|---|---|---|
| Agricultural census | 01_production_censuses/agriculture/source_pdfs/ |
Year-by-year coverage, pre-1949 gaps noted |
| Mining census | 01_production_censuses/mining/source_pdfs/ |
1955-1988 coverage |
| Manufacturing census | 01_production_censuses/manufacturing/source_pdfs/ |
1963-1985 coverage |
| Rural property | 01_production_censuses/transfer_exports/ |
NEW — 1939-1960 farm transfers |
| Elections | 02_elections/ |
Two pipelines (oDesk + Stata), 1938-1992 |
| TEBA/NRC records | 06_mining/teba_archive/ |
180 PDFs, 37 CSVs in data |
| Strike working papers | 05_labor/strikes/ |
60 PDFs, WP3-32 (1978-84) not yet digitized |
| SAIRR surveys | 10_social/race_relations/ |
16 PDFs — narrative, not tabular |