Bivariate Data Describes Relationships Between Two Variables
- In many real situations you measure two numerical variables for each individual or item.
- For instance, you might record a student's music exam score and the same student's art exam score, or the number of laps a student can run and the hours of TV they watch each week.
Bivariate data
Data consisting of pairs of values $(x, y)$ where two quantitative variables are measured for each case.
- Usually, one variable is treated as the independent variable (the input) and the other as the dependent variable (the output).
- On graphs, the independent variable is plotted on the $x$-axis and the dependent variable is plotted on the $y$-axis.
- Choosing which variable is "independent" depends on the context.
- If you are investigating whether TV time affects fitness, then TV hours could be the independent variable.
- If you are predicting art scores from music scores, then music is the independent variable.
Scatter Diagrams Reveal Patterns And Correlation
- The main graph used for bivariate data is a scatter diagram (or scatter plot).
- Each pair $(x, y)$ is plotted as a single point.
- A scatter diagram helps you judge:
- the direction of the relationship (upward or downward trend),
- the strength (how tightly points cluster around a line),
- any outliers (points that do not fit the overall pattern).
Correlation
A measure of the association between two variables, describing whether they tend to increase together, decrease together, or show no consistent pattern.
Positive, Negative, And No Correlation
- Positive correlation: as $x$ increases, $y$ tends to increase (points slope upward).
- Negative correlation: as $x$ increases, $y$ tends to decrease (points slope downward).
- No correlation: there is no clear trend (points look scattered with no direction).
The closer the points lie to a straight line, the stronger the correlation.
- Correlation does not prove causation.
- Even if two variables are strongly correlated, that does not automatically mean one causes the other.
The Mean Point Is A Key Reference: $(\bar{x}, \bar{y})$
- For a set of bivariate data, calculate the mean of the $x$-values, called $\bar{x}$, and the mean of the $y$-values, called $\bar{y}$.
- The point $M(\bar{x}, \bar{y})$ is called the mean point.
- A useful fact is that a sensible line of best fit passes through the mean point $ (\bar{x}, \bar{y})$.
- This is why many "by eye" methods start by plotting the mean point.
Line of best fit
A straight line drawn through the middle of a scatter plot so that points are (roughly) evenly distributed above and below it, used to model and predict relationships.
Drawing A Line Of Best Fit By Eye
A common method is:
- Calculate $\bar{x}$ and $\bar{y}$.
- Plot $M(\bar{x}, \bar{y})$ on the scatter diagram.
- Draw a straight line through $M$ that goes through the "middle" of the data so that points are reasonably balanced above and below the line.
- When drawing a line of best fit by eye, do not join dots and do not try to pass through every point.
- Aim for a representative trend.
- A good check is whether the line goes through the mean point.
Outliers Can Distort Interpretation
Outlier
A data value that lies an unusually large distance from the rest of the data set.
- In bivariate data, you can spot outliers by looking for points that are a large distance from the line of best fit.
- Outliers can happen due to measurement error, unusual circumstances, or because the relationship is not truly linear.
- If a point looks unusual, ask: is it an error, or is it a genuine special case?
- In extended investigations you might compare results with and without the outlier.
Laps Run Versus TV Hours Shows Negative Correlation
Data from 7 students:
- $x$: number of laps a student could run: $2, 15, 17, 3, 20, 3, 6$
- $y$: TV hours per week: $13, 7, 5, 12, 4, 13, 11$
A scatter diagram for this data shows a downward trend, meaning negative correlation (students who run more laps tend to watch fewer hours of TV).
The drawn line here is a computed straight-line fit, but the key interpretation is the same as a by-eye line: the overall trend is decreasing, so the correlation is negative.
Covariance Measures The Direction Of A Linear Relationship
- Correlation can be described visually, but there are also numerical measures.
- One important measure related to correlation is covariance.
- To understand covariance, it helps to recall variance and standard deviation for one variable: $$\sigma^{2}=\frac{\sum (x-\bar{x})^{2}}{n}$$ $$\sigma=\sqrt{\frac{\sum (x-\bar{x})^{2}}{n}}$$
- Variance measures how spread out one-variable data is from the mean.
- Covariance extends this idea to two variables.
Covariance
A measure of how two variables vary together, based on the products of their deviations from their means.
A common formula is: $$s_{xy}=\frac{\sum (x-\bar{x})(y-\bar{y})}{n}$$
Why The Sign Of Covariance Matters
- Consider the deviations from the mean point $(\bar{x},\bar{y})$.
- Draw a vertical line through $\bar{x}$ and a horizontal line through $\bar{y}$, creating four quadrants:
- TRQ (top-right): $x>\bar{x}$ and $y>\bar{y}$, so $(x-\bar{x})>0$ and $(y-\bar{y})>0$, product is positive.
- BLQ (bottom-left): $x<\bar{x}$ and $y<\bar{y}$, so both deviations are negative, product is positive.
- TLQ (top-left): one deviation negative, the other positive, product is negative.
- BRQ (bottom-right): one deviation positive, the other negative, product is negative.
- As a result:
- if most points lie in TRQ and BLQ, the sum of products tends to be positive, so $s_{xy}>0$ (positive slope trend).
- if most points lie in TLQ and BRQ, the sum of products tends to be negative, so $s_{xy}<0$ (negative slope trend).
- Think of covariance as "agreement" in direction.
- If $x$ and $y$ are usually both above their means or both below, they agree, giving a positive covariance.
- If one is above while the other is below, they disagree, giving a negative covariance.
- Another useful form of the covariance formula is: $$s_{xy}=\frac{1}{n}\sum(xy)-\bar{x}\bar{y}$$
- This can be faster because it avoids calculating every deviation first.
- The numerical value of covariance depends on the scale of the variables (for example, changing from cm to m changes the covariance).
- For that reason, covariance is most useful for its sign (positive or negative), not its size.
Using Bivariate Data For Prediction Requires Interpolation, Not Extrapolation
- A line of best fit is often used to predict values.
- For example, if you have a relationship between a music score ($x$) and an art score ($y$), you might use the line to estimate an art score for a given music score.
- However, predictions are only reliable when you predict within the range of the observed data.
Interpolation
Estimating a value within the range of the collected data.
Extrapolation
Estimating a value outside the range of the collected data.
Why Predicting At A Music Score Of 10 Is Not Valid In The Given Data
- In the provided music-art dataset, the music scores ($x$ values) are: $$15, 36, 36, 22, 23, 27, 43, 22, 43, 40, 26$$
- The smallest $x$ value is 15.
- Predicting the art score when the music score is 10 requires extrapolation because 10 is outside the data range.
- The relationship might not continue in the same way below 15, so the prediction is not considered valid.
- When asked whether a prediction is valid, immediately compare the given $x$ value to the minimum and maximum $x$ in the data.
- Outside the range means extrapolation, and you should state it is unreliable.
- What kind of correlation would you expect if most points are in TRQ and BLQ relative to $(\bar{x},\bar{y})$?
- What does a negative covariance tell you about the overall trend?
- Why is extrapolation usually less reliable than interpolation?