Source Archives¶
Raw archival materials are organized in shared_resources/raw_archives/ under 11 numbered domains. These are read-only — analysis-ready outputs go to shared_resources/data/.
All 11 domains were systematically audited in March 2026. Each has a comprehensive README with coverage tables, gap analysis, and digitization priorities.
Archive Structure¶
| Code | Domain | Contents | Key Finding (2026-03 audit) |
|---|---|---|---|
| 01 | Production Censuses | Agricultural, manufacturing, mining census PDFs + contractor digitizations | Pre-1949 ag OCR unusable; 1988/1990 ag + Rural Property PDFs undigitized |
| 02 | Elections | Electoral results, delimitation reports, crosswalks | Two pipelines (oDesk + Stata); 1974/1977 gaps now filled |
| 03 | Geography | Maps, shapefiles, district classifications, rainfall | Primarily GIS reference; tabular data promoted to data/ |
| 04 | Population | Population censuses (1960-2001), household surveys, manpower survey | Manpower survey (165K rows, 1965-94) newly promoted |
| 05 | Labor | Strikes, wages, forced removals, unions, public sector employment | Strike WP3-32 (1978-84) not digitized; wages complete |
| 06 | Mining | TEBA/NRC archive | UJ books 1957-69 (37 PDFs) highest priority for panel extension |
| 07 | Industry | UNIDO, input-output tables, firm-level data, enterprise surveys | All structured (Excel/Stata); no digitization needed |
| 08 | Corporate | McGregor Who Owns Whom, Orbis, Who's Who | Fully digitized |
| 09 | Finance | Macro series (SARB, FRED, World Bank), inflation, CPI | All structured; no digitization needed |
| 10 | Social | SAIRR race relations surveys, education, health, opinion surveys | SAIRR is narrative, not tabular — demoted from digitization queue |
| 11 | Policy | Legislation, Government Gazettes | Text documents, not tabular data |
Digitization Queue (updated March 2026)¶
| # | Domain | What | Status | Notes |
|---|---|---|---|---|
| 1 | TEBA Archive | UJ books 1957-1969 (37 PDFs) | Pending | Extends recruiting panel to 1950s |
| 2 | Production Censuses | Rural Immovable Property (4 PDFs) | COMPLETE | 4,711 rows, all 4 years (1939-1960) |
| 3 | Production Censuses | 1988 + 1990 agriculture censuses | COMPLETE | 3,338 rows, all table types |
| 4 | Production Censuses | Pre-1949 agriculture re-digitization (8 PDFs) | Pending | Current OCR unusable |
| 5 | Labor | Strike Working Papers 3-32 (1978-1984) | Pending | 32 PDFs, extends pre-1984 strikes |
| 6 | TEBA Archive | Wage files (8 PDFs) | Pending | NRC wage schedules |
| 7 | Production Censuses | Manufacturing 1950-61 summary | Pending | Fills manufacturing gap |
| 8 | Social | SAIRR targeted time series | Low priority | Narrative; use as book source |
Completed Digitization¶
| Domain | Files | Method | Date |
|---|---|---|---|
| Industrial Wages | 825 CSVs | Claude Vision API | Nov 2025 |
| Production Censuses (ag/mining/mfg) | 231 CSVs | oDesk contractors + batch_convert | 2013-14, converted 2025-11 |
| Elections | 33 CSVs | oDesk + Stata conversion | 2013-14, Stata converted 2026-03 |
| Public Sector Employment | 43 CSVs | Claude Vision API | 2025 |
| Manpower Survey | 1 CSV (165K rows) | Stata conversion | 2026-03 |
| Rural Immovable Property | 5 CSVs (4,711 rows) | Claude Vision API | 2026-03 |
| 1988 + 1990 Ag Census | 8 CSVs (3,338 rows) | Claude Vision API | 2026-03 |
| TEBA WNLA 1957-1969 | 9 CSVs (3,226 rows) | Claude Vision API | 2026-03 |
| I-O Tables | 48 CSVs | Excel conversion | 2026-03 |
| CPS 1980 | 2 CSVs (84K rows) | Stata conversion | 2026-03 |
| AMP Firm Management | 4 CSVs (19K rows) | Promoted | 2026-03 |
Total Archive¶
- ~1,600 source PDFs across all domains
- 1,900+ analysis-ready CSVs in
shared_resources/data/ - 100% provenance coverage — every CSV traceable to source