Population: the entire group we want to know about.
All the students in your university; All gut bacteria in humans.
Sample: the smaller group we actually measure.
50 students chosen for a survey; 200 stool samples collected for sequencing.
# Imagine the population: everyone's height (mean=170, sd=10)
population <- rnorm(100000, mean=170, sd=10)
# Take different sample sizes
samp_5 <- sample(population, 5)
samp_30 <- sample(population, 30)
samp_200 <- sample(population, 200)
mean(samp_5)
mean(population) # the true mean
if it doesn’t show, open the app here
Studying the whole population is usually impossible:
- Too expensive, too time-consuming, or simply infinite.
- So we use a smaller sample and hope it’s representative.
This is where random sampling matters, so our sample doesn’t mislead us.
- Normal: values cluster around a center and spread symmetrically. Center = (), spread = ().
- Skewed: a long tail to one side (right-skew is common for raw relative abundances; many small values, a few large).
- Multimodal: more than one peak; often signals subgroups or batch effects.
We’ll use these shapes constantly when judging whether a sample can represent a population. For more details, see @ref(data-distribution).