Provenance & Validation¶

The Provenance Chain¶

All analysis data traces back to source PDFs:

PDF (archival source) → Excel (oDesk digitization 2013-14) → CSV (batch conversion) → Validation → Analysis

Digitization Quality

The oDesk contractors sometimes cut corners. Data in CSVs may contain transcription errors from the original digitization. No data should be used in analysis without validation against source PDFs.

Validation Methods¶

Internal Consistency Checks¶

Census tables typically include totals that should equal the sum of components. This is the most reliable validation method:

District sums vs. regional totals — compare summed district values against published regional aggregates
Row totals vs. column sums — cross-check within tables
Year-over-year plausibility — flag suspiciously large changes

PDF Comparison¶

For critical data (1974-1976 crisis years), values were compared directly against source PDFs:

Manual spot-checking of random samples
Claude Vision API automated extraction (mixed reliability for row alignment)
Chris Boone's independent verification of 1976 data (160 corrections applied)

Validation Scripts¶

Available in ops_admin/scripts/:

Script	Domain	Method
`validate_ag_census_content.py`	Agriculture	Province totals vs district sums
`batch_validate_ag_census.py`	Agriculture	All years automated
`batch_validate_mining.py`	Mining	Internal consistency
`batch_validate_manufacturing.py`	Manufacturing	Internal consistency
`validate_strike_sample.py`	Strikes	Cross-reference checks

Validation Results¶

Agricultural Census¶

Year	Status	Error Rate	Notes
1949	Validated	1-6%	Excellent quality
1956	Validated	0-23%	Good quality
1961	Validated	0-23%	Good quality
1972	Validated	37 errors/57 regions	86% pass rate
1974	Validated	39 errors	Crisis start year
1975	Validated	Some errors	Crisis end year
1976	Corrected	160 fixes applied	Chris Boone corrections, Dec 2025
1978-1983	Different format	255 cols	Wide format, not employment data

Corrections Applied¶

All corrections are logged in shared_resources/data/ag_census/CORRECTIONS_LOG.md:

Original files backed up to _originals/ subfolders
Changes documented with cell-level detail
Primary error pattern: OCR misreading of digits (3↔5, 0↔O)

Master Provenance Index¶

As of March 2026, a machine-readable provenance index covers all 1,800+ CSVs:

PROVENANCE_INDEX.csv — maps every CSV to its source Excel file, source PDF, original publication, and digitization method
PROVENANCE_INDEX.md — human-readable summary grouped by domain
Generated by: ops_admin/scripts/build_provenance_index.py (re-run to update)

Coverage: 277 files with exact per-file provenance (CSV matched to source Excel and PDF), 1,525 with domain-level provenance (source publication and method known). 100% of files have documented provenance.

Source PDF Locations¶

Source PDFs are organized in shared_resources/raw_archives/ under 11 numbered domains. Each domain has a comprehensive README with coverage tables and gap analysis (audited March 2026):

Domain	Location	README
Agricultural census	`01_production_censuses/agriculture/source_pdfs/`	Year-by-year coverage, pre-1949 gaps noted
Mining census	`01_production_censuses/mining/source_pdfs/`	1955-1988 coverage
Manufacturing census	`01_production_censuses/manufacturing/source_pdfs/`	1963-1985 coverage
Rural property	`01_production_censuses/transfer_exports/`	NEW — 1939-1960 farm transfers
Elections	`02_elections/`	Two pipelines (oDesk + Stata), 1938-1992
TEBA/NRC records	`06_mining/teba_archive/`	180 PDFs, 37 CSVs in data
Strike working papers	`05_labor/strikes/`	60 PDFs, WP3-32 (1978-84) not yet digitized
SAIRR surveys	`10_social/race_relations/`	16 PDFs — narrative, not tabular