3  Populations and Samples

Population: the entire group we want to know about.

All the students in your university; All gut bacteria in humans.

Sample: the smaller group we actually measure.

50 students chosen for a survey; 200 stool samples collected for sequencing.

# Imagine the population: everyone's height (mean=170, sd=10)
population <- rnorm(100000, mean=170, sd=10)

# Take different sample sizes
samp_5   <- sample(population, 5)
samp_30  <- sample(population, 30)
samp_200 <- sample(population, 200)

mean(samp_5)
[1] 162.418
mean(samp_30)
[1] 174.1409
mean(samp_200)
[1] 169.5315
mean(population)  # the true mean
[1] 169.9871

if it doesn’t show, open the app here

Studying the whole population is usually impossible:

This is where random sampling matters, so our sample doesn’t mislead us.

What does “shape” mean?
  • Normal: values cluster around a center and spread symmetrically. Center = (), spread = ().
  • Skewed: a long tail to one side (right-skew is common for raw relative abundances; many small values, a few large).
  • Multimodal: more than one peak; often signals subgroups or batch effects.

We’ll use these shapes constantly when judging whether a sample can represent a population. For more details, see @ref(data-distribution).