3 Populations and Samples

Population: the entire group we want to know about.

All the students in your university; All gut bacteria in humans.

Sample: the smaller group we actually measure.

50 students chosen for a survey; 200 stool samples collected for sequencing.

# Imagine the population: everyone's height (mean=170, sd=10)
population <- rnorm(100000, mean=170, sd=10)

# Take different sample sizes
samp_5   <- sample(population, 5)
samp_30  <- sample(population, 30)
samp_200 <- sample(population, 200)

mean(samp_5)

[1] 173.7291

mean(samp_30)

[1] 168.404

mean(samp_200)

[1] 170.8165

mean(population)  # the true mean

[1] 170.0232

if it doesn’t show, open the app here

Studying the whole population is usually impossible:

Too expensive, too time-consuming, or simply infinite.
So we use a smaller sample and hope it’s representative.

This is where random sampling matters, so our sample doesn’t mislead us.

What does “shape” mean?

Normal: values cluster around a center and spread symmetrically. Center = (), spread = ().
Skewed: a long tail to one side (right-skew is common for raw relative abundances; many small values, a few large).
Multimodal: more than one peak; often signals subgroups or batch effects.

We’ll use these shapes constantly when judging whether a sample can represent a population. For more details, see @ref(data-distribution).