<definition term="Population">The **entire** set of **individuals** or **items** of **interest** in a **statistical** **study**.</definition> <definition term="Sample">A **subset** of the **population** that is **used** to **make** **inferences** about the **population**.</definition> <definition term="Population Proportion">The **fraction** of the **population** that **possesses** a **certain** **characteristic**.</definition> <definition term="Sample Proportion">The **fraction** of the **sample** that **possesses** a **certain** **characteristic**.</definition> <definition term="Population Mean">The **average** of a **characteristic** for the **entire** **population**.</definition> <definition term="Sample Mean">The **average** of a **characteristic** for the **sample**.</definition> <definition term="Standard Error">A **measure** of the **variability** of a **statistic** (such as the **sample** **mean** or **sample** **proportion**) from **sample** to **sample**.</definition> <callout type="note">The **standard error** is a **measure** of the **variability** of a **statistic** (such as the **sample** **mean** or **sample** **proportion**) from **sample** to **sample**. </callout> ## Estimating Population Proportion ### Using One Sample The **population proportion** is the **fraction** of the **population** that **possesses** a **certain** **characteristic**. The **sample proportion** is the **fraction** of the **sample** that **possesses** the **same** **characteristic**. The **sample proportion** is **used** to **estimate** the **population proportion**. <callout type="example">In an auditorium with 800 students, a certain percent of them will be boys and a certain percent will be girls. If 450 of those students are boys, then the percent of total students who are boys is $\frac{450}{800} = 0.5625$. This percent is the **population proportion**, represented by the symbol \\$p\\$, of boys for the entire population of students. If, for whatever reason, we are only able to survey a subset of 75 students, we can calculate the **sample proportion** for that sample. If 45 out of those 75 students were boys, the **sample proportion** for the percent of boys, represented by the symbol \\$\hat{p}\\$, would be $\frac{45}{75} = 0.6$. </callout> The **standard error** of the **sample proportion** is given by: $$\text{SE} = 2 \cdot \sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$$ where \\$n\\$ is the **sample size**. <callout type="example">In the example above, the **standard error** is: $$\text{SE} = 2 \cdot \sqrt{\frac{0.60(1-0.60)}{75}} \approx 0.11$$ </callout> The **population proportion** is **estimated** to be in the **interval**: $$\hat{p} - \text{SE} \le p \le \hat{p} + \text{SE}$$ <callout type="example">In the example above, the **population proportion** is **estimated** to be in the **interval**: $$0.60 - 0.11 \le p \le 0.60 + 0.11$$ $$0.49 \le p \le 0.71$$ This means that the **population proportion** is **almost surely** between \\$49\%\\$ and \\$71\%\\$. </callout> ### Using Multiple Samples To get a **more accurate estimate** of the **population proportion**, **multiple samples** can be used. <callout type="example">In the auditorium example, one sample set of 75 students with a sample mean of 0.60 was used. To get a better approximation, repeat this process 50 times. The first sample of 75 had 45 boys, which was 60%. The second sample of 75 had 35 boys, which was approximately 47%. A computer is used to do this process 50 times. When the percent of boys for all 50 sets of 75 students are plotted on a dot plot, it looks like this. This plot is called the **sampling distribution of the sample proportion**. Two statistics about this plot are also provided. The mean value for the numbers in this plot is approximately 0.54, and the standard deviation for the numbers in this plot is approximately 0.05. These two numbers can now be used to approximate the population proportion of the complete data set. </callout> The **standard error** of the **sampling distribution** is given by: $$\text{SE} = 2 \cdot \sigma$$ where \\$\sigma\\$ is the **standard deviation** of the **sampling distribution**. <callout type="example">In the example above, the **standard error** is: $$\text{SE} = 2 \cdot 0.05 = 0.10$$ </callout> The **population proportion** is **estimated** to be in the **interval**: $$\text{mean of plot} - \text{SE} \le p \le \text{mean of plot} + \text{SE}$$ <callout type="example">In the example above, the **population proportion** is **estimated** to be in the **interval**: $$0.54 - 0.10 \le p \le 0.54 + 0.10$$ $$0.44 \le p \le 0.64$$ This means that the **population proportion** is **almost surely** between \\$44\%\\$ and \\$64\%\\$. </callout> <callout type="note">Creating the **sampling distribution** and **calculating** the **mean** and **standard deviation** of the **sampling distribution** is something that requires a **computer**. </callout> ## Estimating Population Mean ### Using One Sample The **population mean** is the **average** of a **characteristic** for the **entire** **population**. The **sample mean** is the **average** of a **characteristic** for the **sample**. The **sample mean** is **used** to **estimate** the **population mean**. <callout type="example">At a high school graduation party are 600 people, including parents and students. There are more parents than students at the party, and a dot plot of the ages of all the party goers looks like this. The approximate average age of the 600 people at the party can be approximated by first taking a random sample of 30 people at the party. Here is one random sampling of 30 people from the total population of 600 people. 51, 45, 42, 13, 40, 20, 45, 49, 41, 15, 18, 45, 41, 14, 35, 41, 17, 40, 42, 45, 51, 18, 31, 11, 19, 47, 20, 14, 17, 47 To use these 30 numbers to approximate the mean age of the 600 people, first find the mean and the standard deviation of these 30 values. For this sample set, the mean value, represented by \\$\bar{x}\\$, is approximately 32.4. The standard deviation of the sample, represented by \\$s\\$, is approximately 13.8. </callout> The **standard error** of the **sample mean** is given by: $$\text{SE} = 2 \cdot \frac{s}{\sqrt{n}}$$ where \\$n\\$ is the **sample size**. <callout type="example">In the example above, the **standard error** is: $$\text{SE} = 2 \cdot \frac{13.8}{\sqrt{30}} \approx 5.04$$ </callout> The **population mean** is **estimated** to be in the **interval**: $$\bar{x} - \text{SE} \le \mu \le \bar{x} + \text{SE}$$ <callout type="example">In the example above, the **population mean** is **estimated** to be in the **interval**: $$32.4 - 5.04 \le \mu \le 32.4 + 5.04$$ $$27.4 \le \mu \le 37.4$$ This means that the **population mean** is **almost surely** between \\$27.4\\$ and \\$37.4\\$. </callout> ### Using Multiple Samples To get a **more accurate estimate** of the **population mean**, **multiple samples** can be used. <callout type="example">By using just one sample set of 30 ages, it was possible to find an approximate value of the population mean. To get an even more accurate approximation, more sample sets are needed. By using a computer, 100 sample sets of 30 ages each were taken. The mean for the first set, used in the other approximation, was 32.4. The mean for the second set was 36.5. The mean for each of these 100 sets was then plotted on this dot plot or histogram. This is called the **sampling distribution of the sample mean**. Two more statistics will be needed and likely given on the Regents. The mean value for the numbers in the plot is approximately 34.5. The standard deviation is approximately 2.5. </callout> The **standard error** of the **sampling distribution** is given by: $$\text{SE} = 2 \cdot \sigma$$ where \\$\sigma\\$ is the **standard deviation** of the **sampling distribution**. <callout type="example">In the example above, the **standard error** is: $$\text{SE} = 2 \cdot 2.5 = 5$$ </callout> The **population mean** is **estimated** to be in the **interval**: $$\text{mean of plot} - \text{SE} \le \mu \le \text{mean of plot} + \text{SE}$$ <callout type="example">In the example above, the **population mean** is **estimated** to be in the **interval**: $$34.5 - 5 \le \mu \le 34.5 + 5$$ $$29.5 \le \mu \le 39.5$$ This means that the **population mean** is **almost surely** between \\$29.5\\$ and \\$39.5\\$. </callout> <callout type="note">If the **standard deviation** of the **sample means** is not given, it can be **approximated** with the formula \\$\frac{s}{\sqrt{n}}\\$, where \\$s\\$ is the **standard deviation** of any one of the **sample sets** and \\$n\\$ is the **number of samples** in each of the **sample sets**. </callout> ## Randomization Test A **randomization test** is a way to check if the results of a small random sample are near what you would expect (or hope) them to be. <callout type="example">In this example, a focus group of 30 students are asked to give their opinion on a new style of backpack. The manufacturer is willing to mass produce the backpack if 40% of students like it. Unfortunately, only 9 of the 30 students polled, or 30%, liked the new backpack. There is a way to check if that 30% is close enough to the 40% of the entire student population the manufacturer had specified. A **randomization test** can be used. This is similar to the process for estimating the population mean from a bunch of random sample sets. A computer is used to generate 100 random sets of 30 students from an imaginary group of students from which exactly 40% of them do like the backpack. The first set of these imaginary 30 students happened to have 37% liking the backpack. This process is continued 100 times. The percent who like the backpack for each sample set is graphed on a dot plot or histogram. The mean and the standard deviation of those numbers are calculated. In this case, the mean value is 0.404 and the standard deviation is 0.089. Even without being told the mean and standard deviation of this plot of the sampling distribution for the imaginary set, it is possible to approximate them. The mean will be 0.40, since that is the proportion that was used to create the plot. The standard deviation can be calculated with the formula \\$\frac{\sqrt{p(1-p)}}{n}\\$, which was exactly what it was for the data on the dot plot. On the dot plot, it can be seen that 30% or less happened 16 times. So it does not seem so unusual for this to happen. To determine if 30% is something that is **likely** to happen even though the real percent was 40% for this imaginary population, check to see if the 30% is within two standard deviations of the mean. For this example, \\$0.404 - 2 \cdot 0.089 = 0.226\\$ and \\$0.404 + 2 \cdot 0.089 = 0.582\\$. Since 30% is between 22.6% and 58.2%, the company can justify producing the backpacks even though the sample of 30 people had fewer than 40% liking it. </callout> <callout type="note">The **randomization test** is a **non-parametric** test, meaning it does not **assume** a **specific** **distribution** for the **data**. </callout>