Following up on the discussion in my previous entries, this week’s post focuses on what do when working with spatial data where the boundaries of the units change over time. This is a familiar problem for those of doing quantitative history. In the absence of the type of individual-level data that has become commonplace in the social sciences, we are often forced to make due with aggregate data comprised of information on areal units such as counties and precincts. One nice thing about this type of data is that, to the extent that it is a product of official record keeping on the part of the state, it is often available at regular intervals, allowing for the construction of time-series-cross-section (TSCS) data. This is not as easy at it sounds. Whether you are looking at printed tables side-by-side or trying to make sense of a seemingly botched merge in your favorite statistical software, it is not uncommon to see cases appear to come and go as a result of processes such as division and merger. More troubling are the harder-to-spot cases where the name of the observation stays the same, but its boundaries change, which is only apparent if you are also looking at the corresponding maps or happen to have tabular data on land area. When boundaries change, it becomes difficult to compare values over time due to the fact that the cases being observed are, in effect, no longer the same.
I first dealt with this problem while working with county-level data from Minnesota, North Dakota, and South Dakota in the period between 1890 and 1900. This involved looking at both interdecennial and biennial change (i.e., looking at more than just a single pair of years). At the time, I am not sure that I was aware that there were different ways of handling the problem, so I began with what came naturally: aggregation. The basic idea is to combine unstable units to create a common geography that be compared over time. While aggregation is often thought of as an alternative to areal interpolation, which I will discuss at length in my next post, the two approaches can understood in terms of the same basic framework. Generally speaking, the way to deal with changes in boundaries is to project values from a source geography on to a stable target. (I’m about to do some math, but bear with me because it will pay off with some real data analysis.) The relationship between the ns x 1 vector of observed source values ys and the nt x 1 vector of estimated target values yt can be expressed as yt = Wys, with the nt x ns weights matrix W depicting the share of each source value received by each target observation. Whereas interpolation involves estimating the weights that make up W, aggregation involves constructing the target geography that maximizes the number of target observations while ensuring that each source value is allocated to a single target. The two approaches are directly connected, in the sense that if we treat the weights matrix used for interpolation as a network, we can derive the target geography used for aggregation.
This was not obvious to me when I started working on this problem. The area I was looking at was small enough that I could go through by hand and identify the clusters of counties that were changing. This wasn’t fun, but it was doable. At the time, I was working for Katherine Curtis as part of a team of research assistants that included Heather O’Connell and Jack DeWaard. While we each had our separate projects, we all kept bumping into the boundary problem in one way or another. This is led to the idea that we could think about boundary changes as a form of geographic exchange, which led to the possibility of a network analytic solution. As described in my paper with Heather and Katherine, the key to this approach is representing the set of geographic exchanges as a hypergraph (i.e., a network in which edges are allowed to connect more than two nodes), which dramatically simplifies the task of trying to identify common geographies when working with more than two time points. Once you calculate the intersection of the geographies of interest, identifying common geographies is equivalent to identifying the connected components of the resulting hypergraph. This may sound like a lot, but it can actually be done on the fly in R relatively easy. I also wrote a package that is available on GitHub, though it is in dire need of an update. In fact, if you go to the page, you will see a big warning about allowing the package to calculate geographic intersections internally due to issues with the rgeos package. When I get the time, I would like to rewrite the code to not only use the sf and tidygraph packages, but to provide better support for messy geographic data. As of right now, the program doesn’t do anything to try and simplify features and/or fix invalid topologies.
One of the advantages of this approach is that doesn’t require a pre-existing crosswalk, which means that it can be readily applied to new cases. To get a feel for how this works, let’s consider a real example using data from 1860 to examine the relationship between electoral decision-making in 2016 and the legacy of slavery in the American South. I should be clear up front that this is not my primary area of work; I am simply using this case to provide a real-world example of how this method might be used. The reason why I decided to focus on the problem of projecting data from 1860 onto contemporary county boundaries is because this became a test case when Heather O’Connell began using early versions of our method to build a county-level dataset that would allow her to examine the connection between the percentage of the county population that was enslaved in 1860 and the Black-White poverty ratio in 2000. The legacy of slavery literature has since expanded to include work on political attitudes and behavior, as highlighted by the work of Acharya, Blackwell, and Sen (2015), who estimate the effect of slavery on partisanship, among other things. My plan for the rest of this post is to extend this example to the case of voting during the 2016 election, using the network analytic approach just described to construct a stable geography that can be readily linked to both historical and contemporary data.
The first step is to construct the network representing the changes in county boundaries between 1860 and 2010. While there have been changes to county geography since 2010, they are relatively minor and do not affect the identification of common geographies in the study area. The data in question come from the National Historical Geographic Information System (NHGIS) project, which provides geographic boundary files together with census data. I simplified the boundary files prior to analysis in order to speed up the calculation of the geographic intersection. After using these simplified boundary files to calculate the area of overlap between counties in 1860 and counties in 2010, I constructed the exchange network shown in the figure below. A pair of counties is treated as tied if the area of overlap between them is more than 5 percent of the area of either county in the pair. The resulting graph is comprised of 531 components, each of which represents a common geography. Roughly 92 percent of these clusters contain a single dyad. In this context, isolated dyads represent geographically-stable counties, as indicated by the fact that they are only tied to themselves. We also observe a number of small star-networks in which a county from 1860 is connected to two or counties from 2010, suggesting a process of division.
Perhaps the most notable feature in the graph, however, is the pair of large components at the top, each of which contains more than 100 nodes. The largest component, which appears in the top right corner of the figure, stands out in particular because of the giant green blob of 2010 counties surrounding a single 1860 county. This unique pattern is the result of a massive unorganized county in West Texas (not to be confused with West, Texas) that provided land to 72 counties in 2010! The effects of aggregation are apparent in the map below, which shows the percentage of the population in 1860 that was enslaved in each county/county cluster. Looking at the figure, we see large swaths of undifferentiated space in just about every state except Virginia, North Carolina, and, to a lesser extent Tennessee. The amount of information lost due to aggregation is considerable. While not all counties have complete data, it is telling that we go from 887 counties in 1860 and 1143 counties in 2010 to 531 counties/county clusters when we aggregate. This is the cost of using aggregation to create a stable geography. The benefit, however, is that by constructing a geography in which each source value is assigned to a single target, we can be sure of the target values (or at least as sure as we can be of the original tabular data). When we use areal interpolation, on the other hand, we generally preserve the geography of interest while estimating the weights used to allocate values from sources to targets. This introduces the possibility of spatially autocorrelated measurement error, which often goes unmodeled—an idea that I would like to come back to in the future.
Now that we have projected the distribution of the enslaved population on to a stable geography, we can see how it corresponds to patterns of electoral decision-making. Whereas Acharya, Blackwell, and Sen aggregated individual-level data from the 2006, 2008, 2009, 2010, and 2011 iterations of the Cooperative Congressional Election Study (CCES) to estimate the proportion of White voters identifying as Democrat in each county, I use county-level voting returns provided by MIT Election Data and Science Lab. The figure below represents the results from four different models, each of which is designed to estimate the relationship between the percentage of the population that was enslaved in 1860 and the percentage of the county-level vote received by Democratic candidate Hillary Clinton in 2016. If we take these two variables exactly these are to produce an estimate that is both unadjusted and unweighted, we find a strong positive relationship between the concentration of slavery in 1860 and Democratic vote share in 2016. This is not surprising given that the percentage of the population that was enslaved in 1860 is negatively correlated with the size of the White population in the present day, which is negatively correlated with the percentage of votes received by Hillary Clinton. Simply put, counties/county clusters that had more slavery in the past tend to have smaller White populations today, and thus tended to have a higher a Democratic vote share in 2016
Wait. Isn’t this exactly the opposite of what Acharya, Blackwell, and Sen found? Not exactly. Their sample was composed entirely of White respondents, which meant that they were effectively controlling for race. In this case, if we control for the percentage of the population that is White (according to the 2010 census), the direction of the relationship between the size of the enslaved population in 1860 and Democratic vote share in 2016 switches directions, as can be seen in the figure above (the panel depicting the adjusted relationship between Democratic vote share and the concentration of slavery is just a partial regression plot). The estimated effect is now negative, but weaker than we might expect based on the bivariate results reported by Acharya, Blackwell, and Sen, which indicate that a ten-percentage-point increase in the size of the enslaved population is associated with a 2.17-point decrease in the expected percentage of Whites identifying as Democrat. This is roughly four times the size of the effect that we observe when looking at Democratic vote share in 2016 using a model in which we adjust for racial composition, but treat counties/county clusters the same regardless of the number of voters. If we weight each county/county cluster by the number of voters, however, the estimated effect of the size of the enslaved population in 1860 is nearly identical to that reported by Acharya, Blackwell, and Sen, who weighted their observations by sample size! Keep in mind, of course, that we still aren’t estimating a White-specific effect the way they are…
At this point, you should have a ton of questions. If so, I’m right there with you. I will work through this example some more in subsequent posts. There is lots to talk about here. For example, what do the results look like if we use areal interpolation instead of aggregation? Even better, what are we really estimating when we use county-level data? How is this affected by the use of weights? The answers to these questions might surprise you. Indeed, when we model county-level vote share and weight counties by the number of voters, we can think about this as an individual-level model in which individual outcomes are expressed as a function of county-level covariates, much like a multilevel model (minus the correction for grouping effects). What are the implications of omitting individual-level covariates such as race and is there a way to make up for it using county-level counterparts such as the percentage of the county population that is White? The answer is yes, but only if we are willing to make some pretty strong assumptions. This gets us way beyond the current post, which is really about boundary changes in the types of spatial data commonly used in the context of quantitative history. The reason for mentioning all of these issues is that to the extent that historical researchers are forced to work with what is available, we need to be conscious about the limitations inherent to our data. These problems are by no means unsolvable, nor should we let them grind the practice of quantitative history to a halt. How we solve these problems may depend on how we think about the question at hand.