Skip to content

Provenance & Validation

The Provenance Chain

All analysis data traces back to source PDFs:

PDF (archival source) → Excel (oDesk digitization 2013-14) → CSV (batch conversion) → Validation → Analysis

Digitization Quality

The oDesk contractors sometimes cut corners. Data in CSVs may contain transcription errors from the original digitization. No data should be used in analysis without validation against source PDFs.

Validation Methods

Internal Consistency Checks

Census tables typically include totals that should equal the sum of components. This is the most reliable validation method:

  • District sums vs. regional totals — compare summed district values against published regional aggregates
  • Row totals vs. column sums — cross-check within tables
  • Year-over-year plausibility — flag suspiciously large changes

PDF Comparison

For critical data (1974-1976 crisis years), values were compared directly against source PDFs:

  • Manual spot-checking of random samples
  • Claude Vision API automated extraction (mixed reliability for row alignment)
  • Chris Boone's independent verification of 1976 data (160 corrections applied)

Validation Scripts

Available in ops_admin/scripts/:

Script Domain Method
validate_ag_census_content.py Agriculture Province totals vs district sums
batch_validate_ag_census.py Agriculture All years automated
batch_validate_mining.py Mining Internal consistency
batch_validate_manufacturing.py Manufacturing Internal consistency
validate_strike_sample.py Strikes Cross-reference checks

Validation Results

Agricultural Census

Year Status Error Rate Notes
1949 Validated 1-6% Excellent quality
1956 Validated 0-23% Good quality
1961 Validated 0-23% Good quality
1972 Validated 37 errors/57 regions 86% pass rate
1974 Validated 39 errors Crisis start year
1975 Validated Some errors Crisis end year
1976 Corrected 160 fixes applied Chris Boone corrections, Dec 2025
1978-1983 Different format 255 cols Wide format, not employment data

Corrections Applied

All corrections are logged in shared_resources/data/ag_census/CORRECTIONS_LOG.md:

  • Original files backed up to _originals/ subfolders
  • Changes documented with cell-level detail
  • Primary error pattern: OCR misreading of digits (3↔5, 0↔O)

Master Provenance Index

As of March 2026, a machine-readable provenance index covers all 1,800+ CSVs:

  • PROVENANCE_INDEX.csv — maps every CSV to its source Excel file, source PDF, original publication, and digitization method
  • PROVENANCE_INDEX.md — human-readable summary grouped by domain
  • Generated by: ops_admin/scripts/build_provenance_index.py (re-run to update)

Coverage: 277 files with exact per-file provenance (CSV matched to source Excel and PDF), 1,525 with domain-level provenance (source publication and method known). 100% of files have documented provenance.

Source PDF Locations

Source PDFs are organized in shared_resources/raw_archives/ under 11 numbered domains. Each domain has a comprehensive README with coverage tables and gap analysis (audited March 2026):

Domain Location README
Agricultural census 01_production_censuses/agriculture/source_pdfs/ Year-by-year coverage, pre-1949 gaps noted
Mining census 01_production_censuses/mining/source_pdfs/ 1955-1988 coverage
Manufacturing census 01_production_censuses/manufacturing/source_pdfs/ 1963-1985 coverage
Rural property 01_production_censuses/transfer_exports/ NEW — 1939-1960 farm transfers
Elections 02_elections/ Two pipelines (oDesk + Stata), 1938-1992
TEBA/NRC records 06_mining/teba_archive/ 180 PDFs, 37 CSVs in data
Strike working papers 05_labor/strikes/ 60 PDFs, WP3-32 (1978-84) not yet digitized
SAIRR surveys 10_social/race_relations/ 16 PDFs — narrative, not tabular