Spatial auto-correlation is not causation

There’s a strong tendency in human nature to draw distinctions along dichotomous lines. Good and evil, black and white, ugly and pretty. We all know that these distinctions only really work in children’s fiction, and even then tend to fall flat, but we try anyway. In teaching, particularly a new subject, those dichotomies are both useful and can lead to the downfall of a lesson.

In that vein, the instructor in my spatial econometrics workshop last week presented two significant data issues that a researcher might encounter in using spatial data: spatial heterogeneity and spatial dependence.

By way of definition: spatial heterogeneity is simply that there is something about an area or a piece of space that is different than the spaces around it. My dichotomizing, learning mind went immediately to the idea of observables. Clearly, if we are trying to include spatial information–location–in a regression, we know that the area has certain characteristics. As long as we explicitly control for these in our regression (and believe they are accurately measured), it doesn’t present much of a problem.

However, this is not always the case due to the level of analysis problem. In a general econometric specification, we control for the unit of spatial analysis that is relevant–county, Metropolitan Statistical Area (MSA), state, whatever it may be. By choosing the level and assigning a dummy variable, perhaps, we assume that all those characteristics are captured uniquely, but also that they are assigned independently to the spatial unit. Take for instance the distribution of the African-American population in the United States. Regression analysis that uses that variable as a covariate assumes that the number of African-Americans in Georgia is independent from the number of African-Americans in South Carolina, which makes little intuitive sense. Both were states with large plantation economies that employed Black slaves from Africa in production of goods. It makes sense that these two states, spatially proximate, would also have similar factors leading to their demographic makeup. Thus, spatial heterogeneity: areas in the South have higher Black populations than in the North.

The corollary to spatial heterogeneity is spatial dependence. Like spatial heterogeneity, we see patterns occur in certain variables, but rather than an outside, perhaps observable and easily measurable factor that accounts for the clustering, there’s something inherent about the place itself that causes proximate areas to change their realization of some variable. Think of housing prices. Housing prices are higher in places with certain amenities (close to transportation, mountains, whatever), but housing prices are also higher in areas with higher housing prices. Perhaps homeowners see their neighbors selling their houses for more and thus put them on the market for more. Or buyers see houses in the area with higher values and thus are willing to spend more. This spills over county and other lines, too.

Both of these problems, regardless of how strict that line is between the two, manifest in spatial auto-correlation. The variation we see in each variable for two spatially proximate observations is less than the variation for two independently observations because the information comes from the same place. Some of this we can control for, some of it we can’t, and some of it we can try to control for with the tools I’ll discuss in coming days.

Regardless, it’s important to remember that the realization of spatial heterogeneity and spatial dependence is the same mathematically. Statistically, we cannot differentiate between whether some unobservable variable caused everything to be higher, or whether each observation is exerting an effect on its neighbors (a butterfly flaps its wings…). So, even with acknowledgement of these problems, we have not established causation.

A familiar refrain is, thus, minimally modified: spatial auto-correlation is not causation.

A note on correlation and causation: (see Marc Bellemare’s primer for a more detailed explanation)

Anyone who has ever taken a statistics course is familiar with the refrain that correlation is not causation. It’s a common refrain because it’s something that is often ignored when statistics are cited in news articles and personal anecdotes. My favorite example of this is that ice cream sales and murder rates are highly correlated. Only the biggest of scrooges would believe that ice cream sales caused murder rates to increase. In the abridged words of Elle Woods, happy people don’t kill people. And in my words, ice cream makes people happy.

They do move together, though, which is essentially the definition of correlation. When ice cream sales go up, murder rates go up; when murder rates go down, ice cream sales go down. Not because one causes the other, but rather because of the seasonality of both variables. More homicides occur in the summertime, and more ice cream is sold in the summertime.

Spatial Econometrics: The Miniseries

Last week, I spent three days in a workshop (or short course) on spatial econometrics at the University of Colorado‘s interdisciplinary population center, the Institute for Behavioral Science. At the beginning of last semester, many of my methods students expressed interest in doing their research papers on a topic with a significant spatial component. I would have loved for them to incorporate spatial analysis, but it was a topic I had touched only tangentially and didn’t feel qualified to learn it at the same time as teaching that (incredibly demanding) course for the first(ish) time. In addition, having just attended the PAA meetings in San Francisco, I’ve been looking for ways to expand my econometric skills and incorporate spatial data into my work. It was really fantastic. I don’t know whether they’ll be hosting the event again next summer, but do keep a lookout if you’re interested. I thought it was extremely helpful. And fun (see nerdy tweets from last week about loving matrix algebra). Paul Voss, of the University of North Carolina’s Population Center, Elisabeth Root, and Seth Spielman were all great.

I posted a short introduction to spatial econometrics last week based on my readings for the first class and am now excited to share some of the things I learned, so over the next few weeks, I’ll post some of my thoughts in a mini-series on spatial econometrics. This post will be updated with a list of posts in the series, so do follow along.

Experts, please keep me honest! This stuff is very cool, but I’m still a newbie.

Preliminary outline (subject to change):

  1. An introduction to Spatial Econometrics
  2. Spatial Autocorrelation is Not Causation
  3. The Weights Matrix for Spatial Analysis
  4. Some Notes on Terminology in Spatial Econometrics

An introduction to spatial analysis

After my first, rather disastrous, year of graduate school in Boulder, I almost transferred to Geography. Or at least, I thought a lot about it. While the math in Economics was kind of kicking my butt, everyone working with graphs and maps seemed so blissfully happy. Ultimately, I stuck it out in Economics, and am extremely glad that I did, but I haven’t lost my love of maps and have always been curious about spatial research.

Next week, I’ll be doing a three-day workshop at the University of Colorado‘s Institute of Behavioral Science. Many of my economics professors were associated with IBS, but none really did spatial analysis, so I was left to find out some of it on my own. A few years ago, I helped design a survey on handwashing and other hygiene behaviors for a group building latrines and protecting water sources in Nepal. The data are fascinating and though we started analyzing it, everyone had limited use of one of the two tools necessary to do spatial regression. I had the Stata skills and my coauthors had limited GIS skills, but combining them wasn’t going to happen. This short course is hopefully the next step in getting those papers off the ground and into journals, but also more importantly, back to the community where we did the research. Though we’ve presented some findings to them, I’m sure there are many more insights to be had with these data.

With that, I’ll be reading a lot of spatial analysis papers over the next week. The syllabus has hundreds of pages of reading, much of which I’ve printed out and am planning for my long trip back to Colorado next week, but I’m willing to share the “lite” version with you all.

For definitional purposes, spatial analysis is “the formal quantitative study of phenomena that manifest themselves in space,” according to Luc Anselin. More informatively, I think, spatial analysis allows us to “interpret what ‘near’ and ‘distant’ mean in a particular context” and showcase whether and how proximity or location have an effect on an outcome we’re interested in.

Anselin divides spatial analysis into two categories–data-driven analysis and model-driven analysis, and highlights the challenges of each, which I imagine will get plenty of air time next week and are a little bit daunting to a student and devotee of econometrics:

Indeed, the characteristics of spatial data (dependence and heterogeneity) often void the attractive properties of standard statistical techniques. Since most EDA techniques are based on an assumption of independence, they cannot be implemented uncritically for spatial data…As a result, many results from the analysis of time series data will not apply to spatial data.

Model-driven analysis seems much more up my alley and suited to regression, but the main problem, which I encountered in my own research, “is how to formalize the role of ‘space.'”

Just like this basic the ideas and tools used in spatial regression seem fairly consistent with my view of econometrics in general. There are tradeoffs to employing different models and assumptions, and measurement error is alive and well. Notably, although this could be out of date by now: “Spatial effects in models with limited dependent variables, censored and truncated distributions, or in models that have count data have been largely ignored…multivariate dependent distributions other than the normal are highly complex.” More to come, I’m sure. My colleague has already told me I have to teach him in the Fall, and I’m hoping to be able to incorporate some of this into my Methods class, so get ready for some spatial econometrics here.

As an aside, if you happen to be in Colorado, check out these cool solar events that are happening, including a world-record-braeking attempt at the most people in one place to watch a solar eclipse together at CU’s Folsom Stadium. Or, well, you could just go look at it where you are, too.

Referenced: Anselin, Luc. 1989. “What Is Special about Spatial Data? Alternative Perspectives on Spatial Data Analysis.” Conference Proceedings, Spatial Statistics: Past, Present, and Future. Institute of Mathematical Geography, Syracuse University.