Learn
Hypothesis Testing with R
Sample Mean and Population Mean - I

Suppose you want to know the average height of an oak tree in your local park. On Monday, you measure `10` trees and get an average height of `32` ft. On Tuesday, you measure `12` different trees and reach an average height of `35` ft. On Wednesday, you measure the remaining `11` trees in the park, whose average height is `31` ft. The average height for all `33` trees in your local park is `32.8` ft.

The collection of individual height measurements on Monday, Tuesday, and Wednesday are each called samples. A sample is a subset of the entire population (all the oak trees in the park). The mean of each sample is a sample mean and it is an estimate of the population mean.

Note: the sample means (`32` ft., `35` ft., and `31` ft.) were all close to the population mean (`32.8` ft.), but were all slightly different from the population mean and from each other.

For a population, the mean is a constant value no matter how many times it’s recalculated. But with a set of samples, the mean will depend on exactly which samples are selected. From a sample mean, we can then extrapolate the mean of the population as a whole. There are three main reasons we might use sampling:

• data on the entire population is not available
• data on the entire population is available, but it is so large that it is unfeasible to analyze
• meaningful answers to questions can be found faster with sampling

Instructions

1.

In the workspace, we’ve generated a random population of size `300` that follows a normal distribution with a mean of `65`. Update the value of `population_mean` to store the `mean()` of `population`. Does it closely match your expectation?

2.

Let’s look at how the means of different samples can vary within the same population.

The code in the notebook generates 5 random samples from `population`. `sample_1` is displayed and `sample_1_mean` has been calculated.

Replace the `"Not calculated"` strings with calculations of the means for `sample_2`, `sample_3`, `sample_4`, and `sample_5`.

Look at the population mean and the sample means. Are they all the same? All different? Why?