In this article, we recap some of the statistics concepts covered in the previous article, but we extend the knowledge to grouped data and cumulative frequency.
Why We Describe And Quantify Data
- When you collect data from a class survey, a science experiment, or a sustainability study (for example, energy use in different homes), the raw list of values is rarely useful on its own.
- Describing and quantifying data means turning those values into clear summaries and representations so you can spot relationships, trends, and unusual results.
- Two big ideas run through this topic:
- Where the data tends to be (typical values), and
- How spread out the data is (variation).
Distribution
The overall pattern of a data set, including its center, spread, and shape (for example, symmetry, skewness, clusters, and gaps).
Data Type Determines What You Can Calculate And Draw
Before calculating any statistics, identify the type of data, because this affects which summaries and graphs are appropriate.
Qualitative data
Non-numerical information that reveals people's thoughts, feelings, and perceptions, often gathered through interviews or observations.
Quantitative data
Numerical information that can be measured and recorded, such as height, weight, shoe size, or the depth of a kitchen counter.
Quantitative data can be split into:
Discrete data
Quantitative data that can be counted or can only take specific separated values (for example, number of goals scored, number of people, shoe size).
Continuous data
Quantitative data that is measured and can take any value within a range (for example, height, mass, temperature).
- A common mistake is treating continuous data as if it were discrete just because it has been rounded (for example, recording height to the nearest cm).
- The underlying variable is still continuous.
Organizing Raw Data So Patterns Become Visible
Frequency Tables And Grouped Frequency Tables
- A frequency table records how often each value (or category) occurs.
- For discrete quantitative or qualitative data, you can usually list each value/category.
- For continuous data, or when there are many possible values, you often use grouped data, where values are placed into class intervals (for example, 150 to <160 cm).
Grouped frequency table
A frequency table where values are combined into class intervals, and each interval has a frequency.
- Grouped data is efficient, but it loses information.
- Once values are grouped, you cannot know the original individual measurements exactly.
Stem-And-Leaf Diagrams Preserve Individual Values
- A stem-and-leaf diagram is a neat way to organize numerical data while still showing every value.
- The stem represents the leading digit(s).
- The leaf represents the final digit(s).
- A key explains how to read the numbers.
- This makes it easy to see the shape of the distribution and quickly identify the median and quartiles.
A key such as $1\,|\,0$ represents 10 means that a stem of 1 and a leaf of 0 combine to make 10.
Measures Of Central Tendency Describe A Typical Value
The three most common measures are mean, median, and mode.
Mean
- The mean is the arithmetic average.
- For ungrouped data with $n$ values $x1, x2, \dots, x_n$: $$\bar{x}=\frac{x1+x2+\cdots+x_n}{n}$$
- For a frequency table (ungrouped): $$\bar{x}=\frac{\sum (x\times f)}{\sum f}$$
- For a grouped frequency table, the exact mean cannot be found because the individual values are unknown.
- Instead, you estimate it using the class midpoint $m$ for each interval: $$\bar{x}\approx\frac{\sum (m\times f)}{\sum f}$$
To find a class midpoint, average the class boundaries. For 150 to <160, the midpoint is $\frac{150+160}{2}=155$.
Median
- The median is the middle value when the data is ordered.
- If $n$ is odd, the median is the $\frac{n+1}{2}$th value.
- If $n$ is even, the median is the mean of the $\frac{n}{2}$th and $\left(\frac{n}{2}+1\right)$th values.
- The median is also called the second quartile, $\mathbf{Q_2}$.
Mode
- The mode is the most frequent value.
- A data set can have one mode, more than one mode, or no mode.
- For grouped data, you may refer to the modal class (the class interval with the highest frequency).
- The mean is sensitive to extreme values, while the median is more resistant.
- If there are outliers, the median often represents a "typical" value better.
Measures Of Dispersion Describe Variation In The Data
Range
- The range measures the total spread: $$\text{range}=\text{max}-\text{min}$$
- It is quick to calculate but is heavily affected by extreme values.
Quartiles And The Interquartile Range
- Quartiles split ordered data into four equal parts.
- The lower quartile $\mathbf{Q_1}$ is the median of the observations to the left of the median.
- The upper quartile $\mathbf{Q_3}$ is the median of the observations to the right of the median.
- The interquartile range (IQR) measures the spread of the middle 50% of data: $$\text{IQR}=Q_3-Q_1$$
Outliers Using The 1.5×IQR Rule
- An outlier is a data value that does not fit the general pattern of the rest.
- A point is an outlier if it is:
- less than $Q_1-1.5\times\text{IQR}$, or
- greater than $Q_3+1.5\times\text{IQR}$.
When asked to "identify outliers", show your working clearly:
- find $Q_1$ and $Q_3$,
- compute IQR,
- compute the two fences $Q_1-1.5\,\text{IQR}$ and $Q_3+1.5\,\text{IQR}$,
- compare each data value to the fences.
Box-And-Whisker Plots Summarize And Compare Distributions
A box-and-whisker plot (or box plot) is a diagram based on the five-point summary.
- The box runs from $Q_1$ to $Q_3$.
- The line inside the box is the median ($Q_2$).
- The whiskers extend to the minimum and maximum (or sometimes to the last non-outlier values, depending on convention).
- To compare two box plots, focus on:
- medians (typical values),
- QRs (middle spread),
- overall ranges, and
- outliers.
- This gives a much deeper comparison than only comparing means.
Grouped Data And Cumulative Frequency Curves Let You Estimate Quartiles And Percentiles
When data is grouped, you often use a cumulative frequency curve (also called an ogive) to estimate the median, quartiles, and other percentiles.
Cumulative frequency
The running total of frequencies up to a given value or class boundary.
Building A Cumulative Frequency Table (Idea)
- From a grouped frequency table, you add frequencies as you go to create cumulative totals.
- You then plot:
- the upper class boundary on the horizontal axis, and
- the cumulative frequency on the vertical axis.
- Join the points with a smooth increasing curve.
- A cumulative frequency curve is only meaningful for numerical data with an order (quantitative data).
- You cannot make a valid ogive from categories such as eye color.
Reading The Five-Point Summary From A Cumulative Frequency Curve
- Suppose there are $N$ total observations.
- Minimum: estimated from where the curve starts (near cumulative frequency 0).
- Maximum: estimated from where the curve ends (near cumulative frequency $N$).
- Lower quartile $Q_1$: the value at cumulative frequency $0.25N$.
- Median $Q_2$: the value at cumulative frequency $0.50N$.
- Upper quartile $Q_3$: the value at cumulative frequency $0.75N$.
- To find each quartile, draw a horizontal line from the chosen cumulative frequency level to the curve, then drop a vertical line to the value axis.
Percentiles
- A percentile is a value below which a given percentage of data falls.
- The $k$th percentile corresponds to cumulative frequency $\frac{k}{100}N$.
- For example, the 90th percentile is read at $0.90N$.
- If $N=200$ and you want the 90th percentile, use cumulative frequency $0.9\times 200=180$.
- Read across at 180 to the curve, then down to estimate the value.
When given a data set in an investigation or exam, a clear approach helps you avoid errors.
- Identify the data type (qualitative, discrete, continuous).
- Choose an appropriate representation (frequency table, grouped table, stem-and-leaf, box plot, cumulative frequency curve).
- Calculate or estimate key statistics (mean/median/mode; range/IQR; five-point summary).
- Interpret in context: comment on typical values, spread, skew, and possible outliers.
- Explain why the mean from grouped data is only an estimate.
- What does the IQR tell you that the range does not?
- A data set has $Q_1=12$, $Q_3=20$. Find the outlier fences.
- On an ogive with $N=80$, at what cumulative frequencies do you read $Q_1$, $Q_2$, and $Q_3$?
How Real Data Can Be Misleading
Even when calculations are correct, conclusions can still be unreliable if the data collection or representation is poor.
Common issues include:
- Biased sampling (surveying only one group)
- Inappropriate grouping (class widths that hide variation)
- Cherry-picking measures (reporting the mean when the median would be more representative)
- Ignoring outliers without justification
- A city reports "average household energy use" using the mean.
- A few very large homes with high consumption can pull the mean upward, making most households appear less efficient than they really are.
- Reporting the median and a box plot would communicate the typical household more fairly.