- In the previous article, we talked about the importance of spread.
- This articles serves as a recap of standard deviation and discussion about relevant calculations.
Sigma Notation And Summary Statistics Make Calculations Efficient
In statistics we often use sigma notation to write sums compactly.
Sigma notation (Σ)
A notation meaning “sum of.” For example, $\sum x$ means add all the data values $x$ together.
- Two summary statistics appear frequently:
- $\sum x$: the sum of all data values
- $\sum x^2$: the sum of the squares of the data values
- Many calculators (GDCs) can provide these directly on a summary statistics screen, which is helpful when the dataset is large.
- Be careful: $\sum x^2$ means "square each value, then add," not "square the sum."
- In symbols, $\sum x^2 \neq (\sum x)^2$.
Mean And Mean Of Squares Are Not The Same
- The mean of a dataset is $$\mu = \frac{\sum x}{n}$$ where $n$ is the number of values.
- Students sometimes assume that squaring the mean is the same as the mean of the squared values, but they are generally different:
- Square of the mean: $\mu^2 = \left(\frac{\sum x}{n}\right)^2$
- Mean of squares: $\frac{\sum x^2}{n}$
- This distinction matters because the efficient standard deviation formula uses both $\frac{\sum x^2}{n}$ and $\left(\frac{\sum x}{n}\right)^2$.
- Do not replace $\frac{\sum x^2}{n}$ with $\left(\frac{\sum x}{n}\right)^2$.
- They only match in special cases (for example, when all values are equal).
Population Standard Deviation From Raw Data
If your data represents the entire population you care about (not just a sample), use the population standard deviation.
Population standard deviation (σ)
For a full population of size $n$ with mean $\mu$, the standard deviation is
$$\sigma = \sqrt{\frac{\sum (x-\mu)^2}{n}}$$
Step-By-Step Method (Raw Data)
- Find the mean $\mu$.
- Compute each deviation $(x-\mu)$.
- Square each deviation $(x-\mu)^2$.
- Add the squared deviations, then divide by $n$.
- Take the square root.
- This method highlights meaning: you are finding a typical distance from the mean (after squaring to avoid negative and positive deviations cancelling out).
- When working by hand, keep $\mu$ as an exact fraction until the end to reduce rounding error.
A Faster Population Formula Using Σx And Σx²
- Starting from $$\sigma = \sqrt{\frac{\sum (x-\mu)^2}{n}},$$ expanding and simplifying leads to a very practical form: $$\sigma = \sqrt{\frac{\sum x^2}{n} - \left(\frac{\sum x}{n}\right)^2}$$
- This formula shows that to compute $\sigma$ you only need:
- $n$
- $\sum x$
- $\sum x^2$
This "summary statistics" formula is especially useful when a calculator provides $\sum x$ and $\sum x^2$ directly.
- A skier's run times (in minutes) are recorded on $n=15$ runs, with summary statistics $\sum x=129$ and $\sum x^2=1181$.
- Mean: $$\mu = \frac{\sum x}{n} = \frac{129}{15} = 8.6$$
- Population standard deviation: $$\sigma = \sqrt{\frac{\sum x^2}{n} - \left(\frac{\sum x}{n}\right)^2}$$ $$= \sqrt{\frac{1181}{15} - (8.6)^2}$$
- Compute the parts:
- $\frac{1181}{15} \approx 78.7333$
- $(8.6)^2 = 73.96$
- So $$\sigma \approx \sqrt{78.7333 - 73.96} = \sqrt{4.7733} \approx 2.18$$
- Interpretation (if the distribution is approximately normal): about 68% of runs are within $\pm 1\sigma$ of the mean: $$8.6 \pm 2.18 \Rightarrow [6.42,\ 10.78]$$
- A common calculator mistake is to accidentally use $\frac{(\sum x^2)}{n}$ as $\frac{(\sum x)^2}{n}$, or to type $1181^2$ instead of $1181$.
- The formula requires $\sum x^2$ as given, not squared again.
Standard Deviation For Frequency Tables (Discrete Data)
- Sometimes data are given as a frequency table, where each value $x$ occurs $f$ times.
- The mean becomes $$\mu = \frac{\sum fx}{n}$$ where $n=\sum f$.
- The population standard deviation for a discrete frequency table is $$\sigma = \sqrt{\frac{\sum f(x-\mu)^2}{n}}$$
- This is the same idea as before, but each squared deviation is counted according to how often the value occurs.
- In exams, always compute $n$ from the table first: $n=\sum f$.
- Many errors come from dividing by the number of rows instead of the total frequency.
Sample Standard Deviation Estimates A Population Spread
- Often you only have a sample from a larger population (for example, 10 days of bus lateness to estimate lateness for the whole town).
- In that case, you typically use a slightly different denominator.
Sample standard deviation ($s_{n-1}$)
An estimate of a population’s standard deviation based on a sample of size $n$:
$$s_{n-1} = \sqrt{\frac{\sum (x-\bar{x})^2}{n-1}},$$ where $\bar{x}$ is the sample mean.
- The key change is dividing by $n-1$ instead of $n$.
- This adjustment helps make the estimate less biased when using a sample.
Deciding between $\sigma$ (divide by $n$) and $s_{n-1}$ (divide by $n-1$) depends on context: are you describing the entire dataset you have, or using it to estimate a larger population?
Combining Groups: How The Mean Updates
- When two populations (or groups) are combined, you can find the combined mean without listing every value again:
- If group 1 has $n_1$ values with sum $\sum x_1$ and group 2 has $n_2$ values with sum $\sum x_2$, then the combined mean is $$\mu{1+2} = \frac{\sum x_1 + \sum x_2}{n_1 + n_2}$$
- This idea is an example of why sigma notation is powerful: you track totals and counts.
- A combined mean is a weighted mean.
- The larger group has more influence because it contributes more values.
- Your dataset contains every student mark in the year group. Should you use $\sigma$ or $s_{n-1}$?
- You are given $n$, $\sum x$, and $\sum x^2$ only. Which population formula lets you find $\sigma$ directly?