Univariate Statistics: Making Trends Visible In One Variable
- Univariate statistics is about collecting, organizing, representing, and interpreting data that come from one variable (for example, "heights of students" or "daily temperatures").
- In MYP and IB Mathematics, the goal is not just to calculate values, but to choose summaries and representations that reveal relationships, trends, and unusual values in a population.
Univariate data
Data consisting of observations of a single variable for each individual or item (for example, one height measurement per student).
Distribution
The overall pattern of a data set, including its center, spread, and shape (for example, symmetry, skewness, clusters, and gaps).
Classifying Data Helps You Choose The Right Tools
Before calculating anything, identify the type of data, because it affects which graphs and measures are meaningful.
Categorical data
Data grouped into categories or labels (for example, eye color, type of transport). Arithmetic operations like averaging are not meaningful.
Numerical data
Data measured or counted as numbers (for example, height, time, number of siblings). Numerical data can be summarized using measures such as mean and quartiles.
Numerical data is often divided into:
- Discrete data (counts, usually whole numbers, such as "number of goals")
- Continuous data (measurements that can take any value in an interval, such as "mass")
Most of the univariate techniques in this topic (stem-and-leaf diagrams, quartiles, five-number summary, box plots, outliers) are designed for numerical data.
Organizing Data Makes Patterns Easier To See
- Raw data can be hard to interpret.
- You often begin by ordering the values, or by creating a frequency table (especially when values repeat).
- Another powerful organizer is the stem-and-leaf diagram.
Stem-And-Leaf Diagrams Keep The Original Values
- A stem-and-leaf diagram splits each number into a stem (leading digit(s)) and a leaf (final digit).
- It is useful because it shows the shape of the distribution while still preserving the exact data values.
Stem-and-leaf diagram
A display that organizes numerical data by separating each value into a stem and a leaf, allowing the original data values to be reconstructed.
- Always include a key (for example, $1|2 = 12$).
- Without a key, the place value is unclear.
- Decide your stems first and keep the place value consistent.
- Mixing tens and hundreds (or different decimal places) in the same diagram makes it misleading.
Measures Of Central Tendency Describe A Typical Value
To describe a "typical" or "central" value, the most common measures are the mean and the median.
Mean
The arithmetic average of a data set, found by adding all values and dividing by the number of values.
Median
The middle value when the data set is ordered. If there are two middle values, the median is their average.
- The mean uses every value, so it can be affected strongly by outliers.
- The median depends on order rather than size, so it is more resistant to extreme values.
- Think of the mean as a "balancing point" of a set of weights, while the median is the "person in the middle of a line."
- A single extreme value can pull the balancing point, but it often does not change who is in the middle.
Measures Of Dispersion Describe How Spread Out The Data Is
- Two sets of data can have the same center but very different variability.
- Measures of dispersion quantify how spread out the data values are.
Range And Interquartile Range Summarize Spread In Two Ways
Range
Range refers to the maximum distance that people are willing to travel to access a facility or service.
- The range is easy to compute, but it depends only on two values, so it is sensitive to extremes.
- To get a spread measure that focuses on the "middle" of the data, we use quartiles.
Quartiles
Values that split an ordered data set into four equal parts: $Q_1$ (lower quartile), $Q_2$ (the median), and $Q_3$ (upper quartile).
Interquartile range (IQR)
The spread of the middle 50% of the data: $\text{IQR}=Q_3-Q_1$.
The IQR is resistant to outliers because it ignores the lowest and highest 25% of the data.
When comparing variability between two groups, the IQR is often more informative than the range because it reflects the spread of the "typical" values.
Standard Deviation Measures Typical Distance From The Mean
- A more advanced measure of spread is the standard deviation, which is based on how far values typically lie from the mean.
- From the source material, the variance is: $$\sigma^{2}=\frac{\sum (x-\bar{x})^{2}}{n}$$ and the standard deviation is: $$\sigma=\sqrt{\frac{\sum (x-\bar{x})^{2}}{n}}$$
- Here, $\bar{x}$ is the mean, $n$ is the number of values, and $x$ represents each data value.
- The deviations $(x-\bar{x})$ can be positive or negative.
- Squaring ensures they do not cancel out when added, so the variance is always non-negative.
- Do not confuse variance and standard deviation.
- Variance is measured in squared units (for example, cm$^2$).
- Standard deviation takes the square root, returning to the original units (cm).
The Five-Number Summary Gives A Compact Description Of A Distribution
- A widely used univariate summary is the five-number summary.
- In the given context, it is a set of five values that describe a distribution: minimum, $Q_1$, median ($Q_2$), $Q_3$, and maximum.
How To Find A Five-Number Summary
- Order the data from smallest to largest.
- Identify the minimum and maximum.
- Find the median.
- Find $Q_1$ as the median of the lower half of the data.
- Find $Q_3$ as the median of the upper half of the data.
- There are slightly different conventions for quartiles when the number of data points is odd (whether the overall median is included in each half).
- Use one method consistently and follow any method specified by the question.
Box-And-Whisker Plots Turn The Five-Number Summary Into A Picture
- A box-and-whisker plot (often shortened to box plot) is a visual representation of the five-number summary.
- It is especially useful when you want to compare distributions.
Box-and-whisker plot
A diagram that displays the five-number summary using a box from $Q_1$ to $Q_3$, a median line, and whiskers extending to the minimum and maximum (or to the most extreme non-outlier values, depending on convention).
What A Box Plot Tells You Quickly
- Center: the median line shows a typical value.
- Typical spread: the box length is the IQR.
- Overall spread: whiskers show spread beyond the middle 50%.
- Skewness: if the median is not centered in the box, or one whisker is much longer, the distribution may be skewed.
When comparing two box plots, comment in this order: median (center), IQR (typical spread), then overall range and any skewness.
Identifying Outliers Helps You Interpret Data Responsibly
- Some values sit unusually far from the rest.
- These outliers might be errors, or they might represent real individuals who "stand out in a crowd."
Outlier
A data value that lies an unusually large distance from the rest of the data set.
- A common rule uses the IQR:
- Lower fence: $Q_1-1.5\,\text{IQR}$
- Upper fence: $Q_3+1.5\,\text{IQR}$
- Values outside these fences are flagged as potential outliers.
- Do not automatically delete outliers.
- First check for recording or measurement errors, then decide whether the value represents a genuine but unusual case.
Comparing Distributions Requires Center, Spread, And Shape
When you compare two univariate data sets (for example, two classes' test scores), your comparison should include:
- Center: mean or median (often median for skewed data)
- Spread: IQR, range, and sometimes standard deviation
- Shape: symmetry vs skewness, clusters, gaps
- Outliers: whether there are unusual values and how they affect interpretation
Two classes take the same quiz.
- Class A: median 72, IQR 18
- Class B: median 72, IQR 10
Both classes have the same "typical score," but Class B is more consistent because its middle 50% is less spread out.
- Use comparative language: "higher/lower median," "larger/smaller IQR," "more/less spread," "more/less skewed."
- Avoid vague conclusions like "Class A is better" unless you define what "better" means.
- Can you classify a variable as categorical or numerical (and discrete/continuous)?
- Can you construct and interpret a stem-and-leaf diagram with a key?
- Can you find $Q1$, median, and $Q3$ from ordered data and compute the IQR?
- Can you create a five-number summary and sketch a box-and-whisker plot?
- Can you identify potential outliers using the $1.5\times\text{IQR}$ rule?
- Can you compare two distributions using center, spread, and shape?