The statement “correlation does not imply causation” is well known and has become part of everyday language. The expression should more accurately be “correlation does not necessarily imply causation”. In this post we will unpack what the expression means in more detail.

The main ideaCorrelation is necessary for concluding causation, but not sufficient. The absence of correlation means there is no causation. The presence of correlation means that there may be causation, depending on how rigorous the analysis is.

**What is correlation**Let’s start by understanding what is correlation. Correlation is a numerical value that describes the relationship between two sets of data. It is a number between –1 and 1 that describes how two sets of data change together. This number is called the correlation coefficient. The correlation coefficient describes two characteristics of the relationship between the two sets of data:

**Strength of relationship**: 1 and -1 represent strong positive and negative relationships. 0 represents no relationship. Numbers such as 0.2 or –0.3 represent weak relationships.**Direction of relationship**: a negative correlation coefficient represents a negative relationship such that when one data set increases, the other data set decreases. A positive number indicates two data sets that move in the same direction – that is when one data set increases the other increases, or when one decreases the other decreases..

In 1964 a report was released that showed a strong relationship (correlation) between people who smoked and incidence of cancer. The report concluded that smoking caused cancer. A famous statistician named Ronald Fisher challenged the claim. He cited that there may be an unexplored hidden relationship. He pointed out the possibility that there may be a genetic element that caused cancer and that cancer caused people to want to smoke.

**Correlation is a necessary but not sufficient condition for causation**

The bar for making a statement about causation is high. Here is what Fisher brought to light: Let’s assume B is a data set of dollars spent on advertising. C is a data set of sales for the product. Analysis of the data shows a strong positive correlation, lets assume 0.85. It is easy to want to conclude that the increase in advertising caused the increase in the product sales. This is flawed reasoning for two reasons

- It could be that something different caused C (the increase in sales). For example, the competitor product may have been recalled due to a safety issue resulting in increased sales of the product, or there may have been price discounts during the same period increasing customer demand for the product.
- It is possible that a confounding factor A caused B, and thus A caused C. For example, a new product feature was added and launched via an advertising campaign. Did the advertising cause sales to increase or the new product feature? This is called a confounding factor.
- It is possible that A together with B caused C. This is called an interaction effect.

It is for these two reasons that with observational data we cannot conclude causation. It is necessary to have a designed experiment that controls for these potential hidden and confounding factors.

Correlation is required for making a claim about causation. . . . but it is not enough.

Correlation that is measured through a well-designed randomized controlled experiment is necessary for making a claim of causation

**What can we conclude with correlation?**

If analysis of a set of observational data shows correlation, there **may **be causation (because correlation is a required condition for causation). But we can’t conclude causation because a designed experiment is required.

If there is no correlation, we can conclude that there is no causation.

It is important to be aware of these concepts when drawing conclusions and planning actions based on correlation analysis.