Measuring Linear Relationships With Correlation
- Two variables form bivariate data when each individual provides a pair of values, written as $(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)$.
- A scatter diagram (scatter plot) is usually the first representation because it lets you see whether an increase in $x$ tends to be associated with an increase or decrease in $y$.
Correlation
A measure of the association between two variables, describing whether they tend to increase together, decrease together, or show no consistent pattern.
Correlation is useful because it supports generalization: if a relationship is strong and stable, you can often make cautious predictions using a line of best fit.
Pearson's Correlation Coefficient Standardizes Covariance
- From the previous article you should know that covariance which measures whether $x$ and $y$ tend to deviate from their means in the same direction, and its common "population-style" formula is $$s_{xy}=\frac{\sum (x-\bar{x})(y-\bar{y})}{n}$$
- To make the strength of the relationship meaningful, we divide by the spread (standard deviation) of each variable.
Pearson’s product-moment correlation coefficient (PMCC)
A standardized measure of linear correlation given by
$$r=\frac{\frac{1}{n}\sum xy-\bar{x}\bar{y}}{\sigma_x\sigma_y},$$
where $\sigma_x$ and $\sigma_y$ are the standard deviations of $x$ and $y$.
Because we divide by $\sigma_x\sigma_y$, the coefficient $r$ has two key properties:
- It is unit-free.
- It always lies in the interval $-1\le r\le 1$.
Interpretation (linear patterns only):
- $r=1$: points lie exactly on an upward-sloping straight line (perfect positive linear correlation).
- $r=-1$: points lie exactly on a downward-sloping straight line (perfect negative linear correlation).
- $r=0$: no linear relationship (there could still be a curved relationship).
A value like $r\approx 0.8$ suggests a strong positive linear trend, but it does not mean the points lie on a line, and it does not prove that $x$ causes $y$.
How To Compute r From a Table of Values
For a set of $n$ pairs $(x,y)$, a practical workflow is:
- Compute $\bar{x}$ and $\bar{y}$.
- Compute $\sigma_x$ and $\sigma_y$ using $$\sigma_x=\sqrt{\frac{\sum (x-\bar{x})^2}{n}},\qquad \sigma_y=\sqrt{\frac{\sum (y-\bar{y})^2}{n}}$$
- Compute $\frac{1}{n}\sum xy$.
- Substitute into $$r=\frac{\frac{1}{n}\sum xy-\bar{x}\bar{{y}}}{\sigma_x\sigma_y}$$
On most calculators, $r$ is provided in the linear regression statistics menu.
- Even when you use technology, always write an interpretation in context: direction (positive/negative) and strength (weak/moderate/strong).
- Assessment often rewards communication, not just calculating $r$.
Calculating $r$ and Linking It to a Scatter Plot
Consider the bivariate data:
| $x$ | 12 | 14 | 15 | 17 | 19 |
|---|---|---|---|---|---|
| $y$ | 19 | 20 | 22 | 23 | 25 |
First calculate key statistics (rounded where needed):
- $n=5$
- $\sum x=77$, so $\bar{x}=77/5=15.4$
- $\sum y=109$, so $\bar{y}=109/5=21.8$
- $\sum xy=1709$, so $\frac{1}{n}\sum xy=341.8$
Covariance: $$s_{xy}=\frac{1}{n}\sum xy-\bar{x}\bar{y}=341.8-(15.4)(21.8)=6.08$$
Standard deviations: $$\sigma_x\approx 2.417,\qquad \sigma_y\approx 2.136$$
So the correlation coefficient is $$r=\frac{6.08}{(2.417)(2.136)}\approx 0.965$$
The scatter plot shows points clustered close to an upward-sloping line, consistent with a strong positive linear correlation.
If your computed $r$ does not match the visual pattern, check for:
- copying errors in the table
- mixing up $n$ and $n-1$ (different courses/software may use different conventions)
- or forgetting to divide $\sum xy$ by $n$.
Correlation Helps Describe Trends But Does Not Prove Causes
- A strong correlation can justify drawing a line of best fit (by eye or using technology) and using it for interpolation, meaning prediction within the range of your data.
- But there are important limitations.
- Most importantly, correlation is not causation.
- Two variables can be correlated because:
- one causes the other,
- they both depend on a third variable,
- or the apparent relationship is coincidence in a small sample.
- Also be careful with extrapolation (predicting outside the observed $x$ values), because the relationship may change beyond the data range.
- In health data, a measure such as a "blood marker level" might correlate with a "taste score" for a group of patients.
- Even if $r$ is fairly high, it would be unsafe to claim the blood marker causes taste changes without controlled experiments.
- Correlation can suggest hypotheses, not confirm mechanisms.
Correlation In Human Contexts: Attitudes and Scores
- Correlation is not only for physical measurements (mass, length, temperature).
- It is also used with scores and survey data.
- For example, students may have a social media news satisfaction score and a television news satisfaction score.
- A scatter plot of the paired scores can show whether students who rate one medium highly also rate the other highly.
- If the computed $r$ is close to 0 (or negative), you might conclude there is little (or an inverse) linear relationship between the two attitudes, which can lead to discussion about preferences and identity.
- Explain in words what $r=-0.7$ would look like on a scatter plot.
- Why is $r$ unit-free while covariance is not?
- Give one reason why $r\approx 0$ does not guarantee "no relationship".
- In your own words, explain why extrapolation from a correlated data set can be risky.