Pandemics are political, as Alexandra Cirone highlighted in an earlier post. Motivated by the numerous examples of official corruption during the COVID-19 crisis, Francisco Garfias and I decided to investigate how epidemics shaped the potential for rent extraction and the value of holding office in colonial Mexico. This required us to construct a dataset of epidemics across the colony during the 18th century using primary and secondary sources, which was a challenge for several reasons. The historical record of many epidemics is sparse, and preserved documents only present incomplete snapshots of the people and locations affected. Official reports or newspaper accounts of specific epidemics, such as an outbreak of measles in the mining center of Pachuca in 1728 or widespread smallpox in Mexico City in 1734, often left us in the dark about the extent of these crises. Were the outbreaks confined to major settlements like Pachuca and Mexico City? How should we interpret the absence of evidence that surrounding rural districts were affected as well?
This is an example of a general problem. We tend to know a lot more about what happened in economic and political centers like Pachuca or Mexico City than in smaller, rural, or frontier areas. This bias is driven by an even more fundamental issue, highlighted by Adam Slez in an earlier post. Historical documents are not recorded, preserved, or accessed at random. For us to find a report in archive, someone had to decide to (and be able to) write it, others had to decide to preserve it over a long period of time, and still others had to decide to catalog it and make it accessible to us. Certain types of documents are more likely to survive this process than others. Official reports of major events affecting elites or residents of the capital city, for example, may be easy to find, while the personal stories of peasants or socially oppressed groups may be entirely absent. As historians and archivists have long recognized, these systematic “absences” or “silences” in the archive can distort how we think about and learn about the past in important ways.
In this post, I discuss three problems that arise in quantitative HPE research because of systematic silences in the archive—measurement error, endogenous sample selection, and bias in substantive focus—and some ways that my co-authors and I have struggled with these issues in our work. There are a few important things to stress at the outset:
- There is no silver bullet to eliminate these problems. We are usually left triangulating between several imperfect solutions with different costs and benefits. I share examples from my own work not because they are particularly clever or well executed, but because they illustrate how challenging these issues can be.
- These methodological issues also arise in papers relying on secondary sources or existing datasets (where did those data come from?).
- There is no substitute for thinking carefully about your context, how your data were generated, and how observations ended up in your dataset.
- Measurement error
The absence of evidence is not evidence of absence, as the old aphorism tells us. One of the challenges with archival silences is that it can be difficult to tell whether we are unable to find documented evidence of something because it did not occur, because it was not recorded, or because the evidence was not preserved or made accessible. This complicates measurement in various ways.
To return to the example of constructing a dataset on 18th-century epidemics in central Mexico, Francisco and I noticed that certain districts, generally those containing larger settlements and the area around Mexico City, were disproportionately represented in our record of epidemics. It is possible that missing rural, outlying areas were protected from disease due to their lower population density or their remote location away from trade or transit routes. Another possibility is that any epidemics affecting these rural areas were seen as less noteworthy by elites or officials and were therefore not recorded. It is also possible that a written account of an epidemic in “Pachuca” presented to officials in Mexico City should be read to include the smaller, neighboring districts as well. Yet another possibility is that officials simply lacked information about what was going on in remote areas.
Given what we know about the context, all of these explanations are possible. Our epidemics variable is almost certainly measured with some amount of error, and this error is almost certainly related to observable and unobservable characteristics of the districts. As is well known, this introduces bias in our estimates. What is the solution?
Short of getting better data, the ideal options would be to either model the error directly in the unlikely event that it is known and driven by observables (certainly not the case here) or to address the endogeneity using instrumental-variables or similar approaches. Francisco and I had done the latter in other work, but without an obvious instrument in this case, we ended up triangulating between several imperfect solutions. We replicate our results using two different coding rules to determine which districts were affected by epidemics: a strict rule where only districts with direct evidence of disease were coded as “affected” and a coarser one where a reported epidemic would be assumed to affect all districts in a state. This provides some reassurance that the substantive finding is not driven by strong assumptions about geographic coverage of epidemics. We re-estimate all of our models using just the subset of districts that were ever affected by an epidemic in our data to show that our results hold for just this subsample as well. We closely examine one particularly well-documented epidemic where there is better information about which districts were affected, and we take steps to address spatial dependence in this variable in various ways. (See the paper and online appendix for more information.)
These are not perfect solutions, but together they help to build confidence in the general substantive finding. Perhaps the most important thing that we do in our paper, however, is to be clear about the limitations of our data and careful about how we discuss our results in light of these shortcomings.
- Endogenous Sample Selection
Archival silences can introduce endogeneity in other ways as well. Few documents may have been recorded or may survive from contexts where state capacity was low, where there was considerable conflict, or where elites or officials did not live. This means that we may be unable to recover even basic information about these areas, including measures of our outcome variable, explanatory variable, or necessary covariates.
The standard way that statistical software handles observations with incomplete data is to simply drop these observations from the analysis. This is fine if data are missing completely at random (MCAR) or if selection into the dataset is driven solely by observable explanatory variables (exogenous sample selection). Unfortunately, as Adam noted in his post, the availability of historical data is typically not random or exogenous. The factors that determine whether an observation will end up in our estimating sample may include things that are difficult or impossible to measure (state capacity or culture, for example) and are directly related to most outcomes that we care about. If we simply drop these missing areas from our dataset, we introduce bias through endogenous sample selection.
Again, there is no easy solution to this problem. The standard ways of addressing non-random sample selection—selection models, censored regression models, weighting methods, or various methods of imputation—present us with some options, but these often require strong (and arguably implausible) assumptions about how the data were generated as a function of observable and unobservable factors. Probably the most common strategy used in HPE research, and the one I’ve employed the most in my own work, is to simply restrict attention to some region or subsample where complete data are available, such as areas under solid control of an empire or places not at war. However, this changes the population under study and what can be learned from the analysis, which has its own costs (see #3).
In a recent paper, Jennifer Alix-Garcia and I were interested in how urbanization patterns and the nature of geographic advantage had changed in Mexico over the last 450 years. How did the distribution of population change in response to major economic or political shocks, such as the demographic collapse of Mexico’s indigenous population, the Mexican Revolution, or the post-1980 economic liberalization? To investigate, we had to trace population over a long period of time, starting in the 16th century. As one might expect, information is somewhat spotty for the early colonial era, especially in areas that were not under solid Spanish control.
The figure above shows population density in Mexico around 1570 using data compiled by Gerhard (1993). Jen and I address reliability and measurement issues with these data in other work, but here we focus on the problem of geographic coverage. We know very little about the 1570 population of the northeast region or Baja California, and this “missingness” is clearly non-random. These frontier areas were not under solid Spanish control in the 16th century. They have a distinct geography and history compared to the center and south (they did not pay tribute to the Triple Alliance/Aztec Empire, for example). These and other factors almost certainly affected the regions’ population dynamics and economic structure.
As in the prior example, there was no ideal solution. Simply dropping these areas from our analysis would risk distorting our understanding of how geography and historical settlement patterns shaped the contemporary urban environment. However, the various methods of accounting for population all had disadvantages. Again, we decided to triangulate between several imperfect options. We estimated our models in three ways: using just the set of areas with 1570 population, imputing the missing data using observable geographic and historical covariates, and then using the “indicator method” (replace missing data as 0 and add a missing data indicator to account for differences in observed and unobserved data). We also experimented with using a later date as the starting period of our analysis (we know a lot more about these regions in 1750 or 1800).
In the end, point estimates using the three methods of addressing the missing data were almost identical, and there was little substantive difference with alternately using 1570, 1650, or 1800 as the starting point for analysis. This builds confidence in our descriptive findings, but it doesn’t entirely solve the problem. As before, we tried to be transparent about these shortcomings and cautious about how we interpreted the data in light of them (avoiding strong causal claims, for example).
- Bias in Substantive Focus
This problem is possibly the most fundamental and difficult to address. Because of bias in which types of materials or documents are preserved over time, certain types of events or accounts are privileged while others are systematically excluded. When information on underrepresented groups is available, it is often through the accounts of elites or public officials. This shifts what types of data are available, which research projects are feasible, and (in the end) how we view historical events and institutions.
For example, in a working paper, Francisco and I were interested in the relationship between elite politics and popular grievances in determining patterns of rebellion in late colonial Mexico and during Mexico’s War of Independence. As you might expect, it is much easier to find direct, written evidence on the motivations of the elites in this context than the commoners and communities who participated in peasant rebellions. Our data on localized uprisings, from the work of Taylor (1979), are derived from the judicial investigations of the communities that rebelled. These descriptive accounts provide considerable insight into the scope of what occurred, who rebelled, and what the stated motivations of the peasants were, but in the end these criminal records were compiled by officials, not the participants themselves.
It is difficult to overcome this bias entirely. Archivists and librarians now place more attention on acquiring and organizing material from underrepresented populations, and researchers have done some incredible work to find glimpses of these populations through creative sources. It is unfortunately difficult to recover information that was never recorded or material that was destroyed or lost. It is worth reflecting on how this continues to shape HPE research.
The unsatisfying truth is that there is no easy solution to the problem of systematic silence or absence in the archive. As always, researchers should think carefully about how observations ended up in their datasets, how the data were recorded, and what might be missing or obscured. It may be easiest to see these shortcomings when putting together a new dataset, but they are equally relevant when downloading existing data from places like Harvard Dataverse or ICPSR. It is worth thinking about how these issues could affect our inference and whether there are imperfect solutions or complementary data sources that be used to alleviate the problem. Above all, and as always, researchers should be transparent about these issues and discuss their empirical results with care.
For further reading on how librarians and archivists think about the problem of archival silences, see this recent book by Thomas et al. For more on methodological issues and best practices for archival research in the quantitative social sciences, I recommend this working paper by Alex Lee at the University of Rochester.