Skip to content

How Datasets Connect

The datasets in this archive use different geographic coding schemes, time periods, and units of observation. This page explains how they fit together.

The Master District Panel

The core geographic unit is the magisterial district — 279 administrative areas that were stable from roughly 1960-1990. The panel-building scripts in shared_resources/scripts/panel_builders/ assemble a master district-year panel by joining:

Master Roster (279 districts, time-invariant)
  ├── Agricultural Census (employment, wages, mechanization by year)
  ├── TEBA Recruiting (mine recruit counts by year)
  ├── Population (race composition by census year)
  ├── Forced Removals (cumulative removal counts)
  └── Geography Controls (ag suitability, policy implementation dates)
      ↓
Master District Panel (district × year)

Run the pipeline:

cd shared_resources/scripts/panel_builders
make all    # builds everything in dependency order + validates

District ID Systems

Different datasets use different numbering for the same 279 districts:

System Ordering Used By Example
district_id Geographic (1=Namaqualand, 2=Calvinia, ...) Agricultural census, geography panel 1-279
distid Alphabetical (1=Aberdeen, 2=Adelaide, ...) TEBA panel 1-279

These are different numbers for the same districts. The crosswalk file crosswalk_teba_geography_panel.csv maps between them. The panel-building scripts handle this automatically.

Linking Elections to Districts

Electoral constituencies don't map 1:1 to magisterial districts (urban areas had multiple constituencies per district; large rural districts sometimes shared a constituency). Three crosswalk files handle different delimitation periods:

Crosswalk Elections Covered
crosswalk_electoral_magisterial_districts_1966.csv Pre-1974 elections
crosswalk_electoral_magisterial_1974.csv 1974-1979 elections
crosswalk_electoral_magisterial_districts_1980_clean.csv 1980-1989 elections

Linking Strikes to Districts

Strike incidents have location strings (e.g., "Dunlop factory, Durban, Natal") but not standardized district codes. The strike panel builder (08_strike_panel.py) attempts to match location names to districts, but coverage is partial. Manual geocoding would improve this.

What Cannot Be Joined (Yet)

Dataset Geographic Level Why Not District-Level
Mining Census Individual mine Mines can be mapped to districts via nrc_member_mines.csv (17 districts with gold mines), but most mining data uses broad regions (OFS, Witwatersrand, etc.)
Manufacturing Census Industrial region Regions don't correspond to magisterial districts; no crosswalk exists
Manpower Survey National (by sector) No geographic disaggregation — sector × occupation × race only
Industrial Wages Industry × area "Areas" are bargaining council jurisdictions, not magisterial districts

Temporal Coverage

Not all datasets cover the same years. Here's where they overlap:

         1950  1955  1960  1965  1970  1975  1980  1985  1990  1995
Ag Census  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
TEBA            ■■■■■■■■■■■■■■■■■■■■■■■■■
Population ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Elections  ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Strikes                          ■■■■■■■■■■■■■■■■■■■■■■■■■■
Manpower               ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Mining               ■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Mfg Census ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
Sanctions                                    ■■■■■■■■■■■■■■■

The densest overlap is 1965-1985, when most datasets are available simultaneously.