Why Spread Matters In Data
- Measures of central tendency (the mean, median, and mode) describe a "typical" value, but they do not tell you how variable the data are.
- Two data sets can share the same mean (or median) and still be very different because one is tightly clustered while the other is scattered.
Measure of spread
A numerical value that describes how far data values are dispersed (spread out) within a distribution.
In MYP Extended Mathematics, measures of spread help you:
- compare variability between groups,
- judge how representative the mean or median is,
- identify possible outliers,
- support generalizations and predictions (carefully, and in context).
Range Gives The Full Width Of The Data
The range is the most direct measure of spread.
Range
Range refers to the maximum distance that people are willing to travel to access a facility or service.
- Range uses only two values, so it summarizes the overall width of the data very quickly.
- It is easy to compute and explain, but it is highly sensitive to outliers because it depends entirely on the extremes.
- Range is most meaningful when the extremes are genuinely important and reliable, for example:
- weather (the minimum temperature can matter for frost),
- manufacturing tolerance (maximum deviation allowed),
- performance analysis (best and worst results).
A single unusual value can make the range look large even if almost all other values are close together.
Interquartile Range Describes The Middle 50%
- Many real data sets contain extreme values that you may not want to dominate your summary.
- The interquartile range focuses on the middle half of the data.
Quartiles
Values that split an ordered data set into four equal parts: $Q_1$ (lower quartile), $Q_2$ (the median), and $Q_3$ (upper quartile).
Interquartile range (IQR)
The spread of the middle 50% of the data: $\text{IQR}=Q_3-Q_1$.
How To Find The IQR
- Order the data from smallest to largest.
- Find the median $Q_2$.
- Find $Q_1$ as the median of the lower half and $Q_3$ as the median of the upper half.
- Compute $\text{IQR}=Q_3-Q_1$.
IQR is naturally displayed on a box-and-whisker plot, because the box stretches from $Q_1$ to $Q_3$, and the box length equals the IQR.
Your class may follow a specific convention for odd sample sizes, the key is to be consistent.
Why IQR Is Often More Reliable Than Range
IQR is robust (resistant) to outliers because it ignores the lowest 25% and highest 25% of values.
- Consider: $$2,3,3,4,4,5,5,6,30$$
- Range $=30-2=28$ (looks extremely spread out).
- Most values sit between about 3 and 6, so the IQR is much smaller and better reflects the "typical" spread of the main cluster.
Variance And Standard Deviation Measure Spread Around The Mean
- Range and IQR use ordered positions.
- Another important approach is to measure how far values typically lie from the mean, using deviations.
Deviations And Why We Square Them
- For a value $x$, its deviation from the mean is:
- $(x-\mu)$ for a population, or
- $(x-\bar{x})$ for a sample.
- If you add all deviations, positives and negatives cancel and the total is always 0, so you cannot use the simple mean deviation as a spread measure.
- To fix this, we square deviations.
Variance
The mean of the squared deviations from the mean. For a population,
$$\sigma^{2}=\frac{\sum(x-\mu)^{2}}{n}$$
Variance has squared units, which makes it harder to interpret directly.
Standard Deviation Returns To The Original Units
To get back to the original units, we take the square root of the variance.
Standard deviation
The (population) standard deviation is the square root of the variance, that is, the square root of the mean of the squared deviations from the mean. For a population,
$$
\sigma=\sqrt{\frac{\sum(x-\mu)^2}{n}}
$$
This matches the algorithm emphasized in the source material:
- find differences from the mean,
- square the differences,
- find the mean of the squared differences,
- take the square root.
Squaring does two important things: it prevents cancellation (all squared values are positive) and it gives extra weight to larger deviations, which can be useful when extreme differences matter.
Population Versus Sample Notation
You will often switch between describing a whole population and describing a sample from a population.
- Population notation (Greek): mean $\mu$, standard deviation $\sigma$.
- Sample notation (Roman): mean $\bar{x}$, standard deviation commonly $s$ (some texts write $s_n$).
Do not mix symbols: writing $\sigma$ for a sample standard deviation can confuse the reader about what is being measured and which formula is intended.
Interpreting Standard Deviation
- Small standard deviation: values cluster near the mean, so the mean is more representative.
- Large standard deviation: values are widely spread, so the mean may hide big differences in the data.
- Consider the mean as a "balance point" for the data.
- The standard deviation describes how far data points typically sit from that balance point.
Suppose temperatures (in °C) are: $10,16,14,12,18$.
- Mean: $\mu=\dfrac{10+16+14+12+18}{5}=14$
- Deviations from mean: $-4,2,0,-2,4$
- Squared deviations: $16,4,0,4,16$
- Mean of squares: $\dfrac{16+4+0+4+16}{5}=8$
- Standard deviation: $\sigma=\sqrt{8}\approx2.83$ °C
- The mean of the deviations is always 0, which is why we cannot use it as a spread measure.
- Squaring prevents the cancellation.
If you add 3 °C to every temperature, what happens to the standard deviation? Explain using the idea of deviations from the mean.
Choosing The Right Measure Of Spread For The Situation
Different measures of spread answer different questions, so the "best" choice depends on the purpose and on the shape of the data.
Use Range When Extremes Are The Point
Range is best when minimum and maximum values are meaningful and trustworthy, and when outliers are not mistakes.
Use IQR When You Want A Typical Spread With Outliers Present
IQR is a strong choice for skewed data or data sets with outliers, and it often pairs well with the median.
Use Standard Deviation When Mean-Based Reasoning Makes Sense
Standard deviation is especially useful when:
- the distribution is roughly symmetric,
- you want a mean-centered measure of variability,
- you later want to connect spread to models such as the normal distribution.
- When comparing two groups, state both a center and a spread.
- For example: "Group A has a higher median and a smaller IQR, so typical values are higher and more consistent."