Adventures with an (Almost) Amazing Dataset

Apr 21, 2021

This post is about an amazing dataset that looks reliable at first glance but has some serious issues when you look closer.[1] The individual errors are small, subtle, and hard to spot. When added together, though, they can cause big problems for HPE researchers.

Before I get into the details about the specific dataset—why so many social scientists have used it, why the data can be misleading, and what the implications of this are—you might have a question: why should you care about idiosyncratic issues with a single dataset that most Broadstreet readers will never use and may never even encounter?

The specific problems are perhaps only of direct interest to people studying early 20^th-century Mexico, but this example provides general lessons about how small omissions can lead to big differences in inference and what we might miss through the typical ways of assessing the reliability of a dataset. For me, this case also highlights some shortcomings in how we tend to discuss and share information with other scholars about specific data issues that might be boring and not of general interest but can be important for applied work.

Tacubaya: a separate municipality in the early 20th century, which became part of Mexico City and later part of the borough of Miguel Hidalgo.

Background

Some HPE scholars of Mexico[2] may have already guessed which dataset I’ll be discussing: the Historical Archive of Geostatistical Localities (AHLG, formerly AHL), an incredible resource created and maintained by the government’s National Institute of Statistics and Geography (INEGI). This resource is described as including “all of the country’s localities with their respective historical evolution and data from the various censuses.” It is a database with four linked datasets listing the name and geographic position of each locality, the population of the locality at each census going back to 1900, the history of its name changes and political categorizations over time, and a historical summary of major events in the municipality (similar to a county in the US context).

A particularly useful feature of these data is that they include geographic identifiers that enable researchers to merge information from other sources and aggregate data to different levels of analysis: locality, municipality, or state. The official system of numeric codes, developed by INEGI in 1978, specifies how a locality nests within higher-level units. For example, the code for the city of Aguascalientes, where INEGI is based, is 010010001, from its respective state (01), municipality (001), and locality (0001) identifiers.

For quantitative HPE researchers, this is amazing. As Adam Slez discusses in a series of earlier Broadstreet posts, it can be tough to trace territorial units over time for many reasons. By using the numeric codes in the AHLG, researchers can in principle link a locality or municipality across censuses to examine population and related outcomes at a fine-grained level without having to worry about whether the name of a town has changed or whether municipal boundaries may have shifted slightly. I suspect that a lot of existing work using locality- or municipality-level “census data” from early 20^th-century Mexico has used this database to obtain the information for exactly this reason.

What’s the problem?

The problem is closely related to the exact feature that makes the AHLG data so attractive to social scientists: the inclusion of geographic identifiers for merging and aggregating across datasets. Because INEGI’s system of numeric codes was created in 1978, many localities listed on earlier censuses do not have a clear numeric identifier. Some had disappeared before the 1970s. Others had shifted across municipality (i.e., county) boundaries or were merged with other localities. Unfortunately, many of these localities seem to have been entirely omitted from the AHLG database. The vast majority of these omitted localities are small villages of less than 1,000 people. However, when aggregated up, these omissions cause the AHLG to significantly undercount population relative to either state-level data available through INEGI or the published municipal- and locality-level data from earlier censuses.[3]

Population undercounts in the AHLG relative to the state-level census at the national (left) and regional (right) levels

The figure above shows the aggregated undercount, as a percentage of total population, of the AHLG relative to INEGI’s state-level census data over time for the whole country (left) and by region[4] (right). The undercounting is significant for the first few censuses. In 1900, for example, the AHLG undercounts the population of the country by over 2.4 million people, about 17% of the total. The undercount also varies across space. As shown in the right side of the figure, the discrepancy is considerable in all regions up until the middle of the century, but the central region around Mexico City (the green line) is a notable outlier. The AHLG vastly undercounts the population of this region up until the 1980 census (which, as I mentioned in an earlier post, is itself unreliable for other reasons). In 1950, the undercount of the central region reaches almost 30% of the total population as reported by the aggregated census.

What is going on? The majority of this discrepancy can be traced to a single big omission. In December 1970, the government redrew the territorial boundaries within Federal District (now Mexico City), creating the four central delegaciones (territorial demarcations) that comprise the city center. The AHLG simply omits population data from earlier years, which were recorded under the previous territorial divisions that lack geographic identifiers, implicitly recording the entire population of central Mexico City as missing or “0” for all censuses up to 1980.

Central Mexico City is probably the most glaring omission, but there are many others. Zihuatanejo, now a major tourist center on the Pacific coast, does not appear at all in the dataset until it becomes the seat of a new municipality in the 1950s, though it is listed in the hard copy of the earlier censuses and in documents going back to the late colonial period. Many towns in Oaxaca are omitted in the 1900 data because of confusion over later changes to the state’s municipal divisions. Every state in the country has a significant number of localities in 1900 and 1910. The AHLG lists over 500 fewer rural localities in Guerrero as of 1900 relative to the published census information from that time. In the northern state of Coahuila, much of which was sparsely settled, the AHLG contains over 700 fewer rural localities than the published list of localities in the 1900 census. Most of the omitted localities are very small, have ambiguous names, and are consequently not easy to trace.

How bad is this really?

As is often the case, this depends on the research design and the question under study. It is clear that these data present a distorted picture of demographic trends over the 20^th century when aggregated.

Regional population growth from 1940 to 1970 as captured by state-level census data (left) and the AHLG (right)

The figure above compares regional population growth between 1940 and 1990 using aggregated state-level census data (left) and aggregated information from the locality-level AHLG (right), standardizing reported 1940 regional population to 100 in both cases. This was a period of rapid urbanization and demographic expansion in Mexico, and scholars have explored how government policies regarding trade or the degree of political competition may have influenced the spatial pattern of urban development. One might come to somewhat different conclusions about some of these questions looking at the two sides of the chart. Both graphs capture how quickly Mexico’s population grew over this period, but they present a different picture of the relative growth of different regions. This is especially evident when comparing the growth of the central region around Mexico City (green) and the northwest region, which contains the now giant border cities of Tijuana and Ciudad Juarez (purple). Because the AHLG omits much of Mexico City’s population until after the 1970s, the central region looks like it grows much more rapidly from 1970 to 1990 than it actually does.

A researcher solely interested in broad macro-level trends like these probably wouldn’t use the AHLG dataset given how easy it is to get state-level census data for Mexico. The real benefit is in using the geographic codes to follow lower-level units—municipalities or localities—over time. Unfortunately, some problems are evident on this level, too.

The proportion of municipal population living in rural localities in 1900 according to the División Territorial (x-axis) and the AHLG (y-axis). The labels are municipality identifiers.

The figure above compares municipality-level data on the proportion of the population living in rural localities (< 2500 people) in 1900 for two states: Coahuila and Guerrero. (Jennifer Alix-Garcia and I digitized detailed population data for these states in part because of obvious omissions in the AHLG.) The x- and y-axes represent the proportion rural population using data from the published División Territorial (DT) from the 1900 census and using the AHLG respectively. The AHLG undercounts rural population in almost every municipality. The measures are highly correlated, but there are differences in how severe this undercounting across space that can lead to problems in applied research.

Why do people use these data?

One big reason is that researchers might not know about these issues. The problems with this dataset only become evident when looking at very detailed or very aggregated levels of analysis (Why is central Mexico City missing in 1940? Why is the 1900 national population off by over 2 million people?). Most applied researchers, I suspect, are working with municipality- or locality-level population data, where the discrepancies with other datasets can be obscured and where there are enough observations that strange outlier cases like Mexico City might be easily missed. Another complication is that the discrepancies are generally due to omissions rather than to incorrectly recorded figures, which can also make it harder to notice mistakes. I worked with these data for a while before I noticed the problems.

It is also hard to find an alternative source of locality- or municipality-level data going back this far. The numerous boundary revisions and name changes in Mexican municipalities and localities make it difficult to manually trace units over time. “Fixing” the AHLG would be an arduous task, perhaps even an impossible one, given that many of the missing localities never directly appear in later censuses. For all of its flaws, there is a lot that can be learned from the AHLG with some adjustments. For our paper on urban development in Mexico, for example, we manually added missing cities and selected our outcome variable to account for the fact that population in general, and especially rural population, is highly undercounted in the early part of the century.

The problems are not relevant for all empirical work using the AHLG. Someone working with post-1970s data, for example, or using the AHLG to trace a specific locality over time would not need to worry about this. Scholars might also be able to address some of the problems through their research design or by drawing on supplemental data. It is, however, important to know that these problems exist so that researchers can consider whether or not they might matter in a given application and how best to address the issue.

What general lessons can we learn from this example?

Those working with the AHLG should obviously be careful with their analysis for the reasons outlined above, but there are some general lessons that I’ve learned from this experience as well.

Examine your data at multiple levels. Some of the issues with the AHLG only became apparent when I aggregated the data up to examine broad population trends over time or looked carefully at specific localities that looked suspicious. Did Mexico City really quadruple in size between 1960 and 1980? No. Did Tijuana more than double its population between 1950 and 1960? Actually…yes. Mapping the data also helped to highlight big discrepancies in specific states, especially with the 1900 census. These problems were a lot harder to spot in my initial municipal-level analysis.

We should find better ways of sharing information like this. One not-so-secret goal of this post is just to publicize the issues with this dataset so that other people in my field can know about them. Some of us have been aware of these problems for a while, but it is hard to find the right venue to disseminate this information. Idiosyncratic issues with a single dataset aren’t of general interest and consequently don’t end up in disciplinary journals. We understandably spend most of our time in specialized workshops discussing substantive work rather than “minor” details about data generation or processing. As I noted in my earlier post on the problematic 1921 census, the way this information tends to get passed along is by word of mouth through informal conversations with colleagues. This network-dependent process excludes a lot of people, and it leaves others confused about what the problems are given the absence of physical documentation. As a result, researchers may inadvertently continue to misunderstand or misuse the data, to all of our detriment.

This boring data work isn’t always valued, but it can be important. Scholars in our field place a premium on research that makes a theoretical or substantive contribution to a topic of general interest, and perhaps rightfully so. Without reliable data, however, it is hard to know whether to trust the substantive conclusions coming out of this work. Assessing idiosyncratic problems with a single dataset is about as far from a topic of general interest as one could get, but these boring errors can unfortunately make a real difference in our research. It might be especially important to look carefully at datasets like this one that are easy to access and look reliable but may have significant problems just below the surface.

[1] A big thank you to Rohan Alexander and participants of the Toronto Data Workshop for a great discussion on this dataset, the inference issues, and possible solutions last week.

[2] I am grateful to Luz Marina Arias, Alberto Diaz-Cayeros, Francisco Garfias, and especially Jennifer Alix-Garcia for earlier conversations on this topic.

[3] There are some slight discrepancies between these two sources, but these are extremely small relative to the large discrepancy with the AHLG.

[4] I use the regional definitions from the Plan Nacional de Desarrollo, 2002-2006 (p. 10), which are also used by the Mexican Family Life Survey.

Broadstreet

Adventures with an (Almost) Amazing Dataset

Discussion about this post