Measuring Relationships Between Two Variables
- In many real investigations you record paired (bivariate) data, two values measured on the same individuals or trials, such as (height, basketball shots) or (blood marker score, taste score).
- A scatter diagram lets you see whether large values of one variable tend to go with large (or small) values of the other.
- To measure that tendency numerically, we use covariance.
Bivariate data
Data consisting of pairs of values $(x, y)$ where two quantitative variables are measured for each case.
Covariance
A measure of how two variables vary together, based on the products of their deviations from their means.
Covariance Extends the Idea of Variance
- You already know that variance and standard deviation describe how a single variable spreads out around its mean.
- For a data set $x_1, x_2, \dots, x_n$ with mean $\bar{x}$: $$\sigma^2=\frac{\sum (x-\bar{x})^2}{n},\qquad \sigma=\sqrt{\frac{\sum (x-\bar{x})^2}{n}}$$
- Variance squares deviations so that positive and negative deviations do not cancel.
- For paired data $(x_i, y_i)$, we look at two deviations at once: $(x_i-\bar{x})$ and $(yi-\bar{y})$.
- Multiplying them gives information about whether the deviations tend to have the same sign or opposite signs.
The Covariance Formula and What Its Sign Means
For $n$ paired observations $(x_1,y_1),\dots,(x_n,y_n)$, with means $\bar{x}$ and $\bar{y}$, the covariance is $$s_{xy}=\frac{\sum (x-\bar{x})(y-\bar{y})}{n}$$
Why Multiplying Deviations Works
Each point on a scatter plot lies in one of four quadrants relative to the mean point $(\bar{x},\bar{y})$.
- Top Right Quadrant (TRQ): $x>\bar{x}$ and $y>\bar{y}$ so both deviations are positive, product is positive.
- Bottom Left Quadrant (BLQ): $x<\bar{x}$ and $y<\bar{y}$ so both deviations are negative, product is positive.
- Top Left Quadrant (TLQ): $x<\bar{x}$ and $y>\bar{y}$ so product is negative.
- Bottom Right Quadrant (BRQ): $x>\bar{x}$ and $y<\bar{y}$ so product is negative.
So, when you add all the products $(x-\bar{x})(y-\bar{y})$:
- If most points lie in TRQ and BLQ, the sum tends to be positive, so $s_{xy}>0$ and the trend has a positive slope.
- If most points lie in TLQ and BRQ, the sum tends to be negative, so $s_{xy}<0$ and the trend has a negative slope.
- If there is no clear tendency, positive and negative products cancel, so $s_{xy}\approx 0$.
Covariance describes direction (positive, negative, or near zero) more reliably than it describes strength, because its size depends on the units and scale of the variables.
An Equivalent Computation Formula (Often Faster)
Expanding the product and simplifying leads to a commonly used alternative form: $$s_{xy}=\frac{1}{n}\sum(xy)-\bar{x}\bar{y}.$$
This formula can be more convenient because you can compute $\sum xy$, $\bar{x}$, and $\bar{y}$ directly from a table.
- If your data are in a table, create three columns: $x$, $y$, and $xy$.
- Then compute $\sum x$, $\sum y$, and $\sum xy$ once.
- This makes it easy to find $\bar{x}$, $\bar{y}$, and then $s_{xy}=\frac{1}{n}\sum(xy)-\bar{x}\bar{y}$.
Height and Basketball Shots
The table shows heights (cm) of nine students and the number of basketballs shot in five minutes.
- $n=9$
- $\sum x = 1515$ so $\bar{x}=\frac{1515}{9}$
- $\sum y = 624$ so $\bar{y}=\frac{624}{9}$
- $\sum xy = 106530$
Using the computational formula:
$$s_{xy}=\frac{1}{9}(106530)-\left(\frac{1515}{9}\right)\left(\frac{624}{9}\right)$$
$$\approx 166\;\text{(3 s.f.)}$$
Because $s_{xy}$ is positive, the data tend to lie more in TRQ and BLQ, suggesting that taller students tend to make more shots in the same time period.
- A positive covariance does not prove that height causes better shooting.
- Covariance measures association, not causation.
Interpreting Magnitude: Why Scale Matters
- Unlike a correlation coefficient, covariance has units.
- If $x$ is measured in cm and $y$ is a count, then $s_{xy}$ has units "cm·shots".
- If you change units (cm to m) or rescale one variable, the covariance changes even though the scatter plot looks identical up to scaling.
- If you replace $x$ by $x' = 0.01x$ (cm to m), then $$s_{x'y}=0.01\,s_{xy}$$
- That is why the raw value of covariance "conveys little information" about strength on its own.
- Think of covariance like the tilt of a collection of points, while the units act like a zoom on the axes.
- Zooming in or out changes numerical differences, so the covariance number changes, even though the overall tilt looks the same.
A Structured Method for Finding Covariance from a Table
- List the $n$ pairs $(x,y)$.
- Compute $\sum x$ and $\sum y$, then find $\bar{x}=\frac{\sum x}{n}$ and $\bar{y}=\frac{\sum y}{n}$.
- Either:
- compute $\sum (x-\bar{x})(y-\bar{y})$ and divide by $n$, or
- compute $\sum xy$ and use $s_{xy}=\frac{1}{n}\sum(xy)-\bar{x}\bar{y}$.
- Interpret the sign (positive, negative, near zero) in context.
If the means are awkward fractions, the computational formula usually avoids repeated subtraction and reduces arithmetic errors.
Common misconceptions about covariance are:
- Covariance can be negative, positive, or zero; it is not forced to be positive like variance.
- A covariance near 0 can still happen with a curved relationship (for example, $y=x^2$ with symmetric $x$ values).
- Always check your scatter plot first; if the plot shows a clear upward trend but your covariance is negative, you likely swapped values or made an arithmetic sign error.
- What does the sign of $s_{xy}$ tell you about the overall slope of the trend?
- Why does changing units (for example, cm to m) change covariance?
- For a point in the TLQ, is $(x-\bar{x})(y-\bar{y})$ positive or negative?