Correlation, Prediction, and Causation

Correlation, Prediction, and Causation

In an article published in Sunday’s Capital Journal, Celia Llopis-Jepsen reports that three well-known researchers in the field of educational outcomes reviewed a report published by KASB about the relationship of funding and outcomes, and they came to the conclusion that the report commits a common falsehood in social research; that it claims causation while only demonstrating correlation.

When Celia and I spoke last week, and she let me know what the researchers’ reaction to the report had been, I was disappointed. I had tried to be careful and avoid language that implied causation, because I knew that the statistics I had used for the analysis; correlation and simple linear regression, were not able to demonstrate a causal relationship between one variable and another. At most I was trying to demonstrate that higher education funding predicts higher student outcomes. Looking at the report now, I suspect that phrases such as “has an impact on” and “accounts for” are what led to this criticism.

Based on this feedback, I thought it was worthwhile to describe the differences, as I see them, between correlation, prediction, and causation.

Correlation analysis yields statistics that indicate the amount to which two variables move together.

Picture I-70 between Topeka and Lawrence at about 4:00 p.m. on a weekday afternoon. There are several cars headed eastbound, all at approximately the same speed and in the same direction. You can say that cars’ motion is correlated; they are moving together. But there is no prediction or causation between them. You cannot assert that any one of these cars exiting from the highway can predict or be shown to cause the exiting of any other cars. They are all moving in the same direction based on different causes; each driver has an independent and unrelated reason for traveling on that highway in that direction at that time.

Now picture a small fishing stream. We see two fish swimming in the same direction downstream. You can say that the fish’s movement is correlated because they are moving together, but you cannot say that the movement of either predicts or causes the other one. You could, however, say that they are both traveling downstream based on the same cause; the current of the stream itself. It could also be possible that they are both traveling downstream for the same reason; they may both be seeking food or heading to some other common location, but you cannot, based on your observations, predict or assert anything about one’s movements based on the movements of the other except to say that they seem to be moving in the same direction.

Regression analysis yields statistics that indicate the amount to which the value of one variable predicts the value of another.

The distinction between correlation and regression is difficult to describe, and despite the fact that I have a master’s degree in educational psychology and have spent many an hour knee-deep in statistical analysis much more complex, I find it hard to explain the difference in ways that make sense to those who do not spend much time with statistics.

In short, correlation indicates the amount to which two variables move together. Regression indicates the extent to which changes in one variable (the independent variable) can predict changes in the other (the dependent variable). The very language used in identifying the variables is confusing because of how it implies causation, when the statistics themselves are not offering proof of causation.

In one of my statistics classes years ago, the instructor tried to explain the difference between prediction and causation via this example. He/she indicated there are insurance companies that will charge people higher premiums for red cars than for cars of any other color. They do so based on analysis showing that red cars are more likely to be involved in motor vehicle accidents than non-red cars.

Does this mean they think the color of the car actually causes more accidents? Of course not. But they acknowledge the research shows red cars are in more accidents, and that is enough for them to justify charging higher premiums for red cars.

Psychologists would probably argue that people who tend to be more adventurous and reckless are more likely to buy red cars, and it is this tendency that causes both the car color choice and the higher instance of accidents. But regardless of the underlying causation, they can argue based on the research that car color (the independent variable) predicts the likelihood of accidents (the dependent variable).

Taking an example from my own personal life, I remember driving through Kansas on the way home from a weekend trip with friends. One of my friends looked out into the cow pasture and asserted, “It is going to rain.”  The rest of us in the car asked her what the heck she was talking about.  She pointed at a group of cows, all lying on the ground, and said, “It’s going to rain. The cows are lying down.”  I looked up and saw that there wasn’t a cloud in the sky, and was certain she was either making this up or she was crazy.

A couple of hours later after we had all made it to our respective homes, we experienced torrential downpours that none of us could have predicted based on the beautiful blue sky. But the one friend (whom the group later labeled “The Great Cow Oracle”) had been able to accurately predict the rain based on the behavior of the cows. I’ve told this story to others in the years since, and they have explained that cows sense barometric pressure changes or something else that tells them rain is coming, so they lie down in groups in order to rest up for the storm, where they will need to remain standing so as not to get too cold on the wet ground. Again, the cows lying down (the independent variable) does not cause the rain (the dependent variable), but it is a reliable predictor of it.

Causation is causation

Causation should be a pretty clear concept to most people.  In the Capital-Journal article I mentioned at the beginning of this post, I am quoted as saying, “At the level of analysis we’ve done, it definitely does not show causation…  In social research, the best we can hope for are indicators.”

I went on to discuss with Celia what most psychology and sociology classes will teach you; that in social research, it is very difficult to control or manipulate variables in order to find evidence of causation, and to do so in many cases would be unethical. We are unable to treat the real world as a controlled experiment, where we take two seemingly equal groups and manipulate the nature of a certain variable differently for each group to see what the outcomes are.

Instead, we are limited to looking at the data as it exists, and applying statistical methods that will allow us to determine what variables are significant predictors of which other variables. And though researchers have developed complex and sophisticated statistical means of doing this, in the end I believe the most we can do is identify those things which most effectively predict the outcomes we seek, and focus our energies in making improvements based on the relationships that seem to exist between them.

In the case of the article in question, myself and the Kansas Association of School Boards have a vested interest in determining what will increase the positive outcomes for students and schools in Kansas, and work to let folks know what we find and what steps might be effective in making improvements. That was my goal with this report, and it is unfortunate that others doubt the sincerity of that goal. I will strive to ensure that future articles much more explicitly indicate the absence of causal evidence and that they illustrate why it is still pertinent to pay attention to the predictors.

For more discussion on cause and effect in social research, take a look at the following links: