6 Data Distribution
A distribution describes how values of a variable are spread out. Say Shape of a data.
It answers questions like:
- What values are most common?
- Are most values small or large?
- Are there many zeros or extreme outliers?
- Is the distribution symmetric, skewed, or multimodal?
Basically, it is the fingerprint of variability in a dataset.
6.1 Why knowing the type of distribution is so important?
- Determines the choice of statistical tests.
- Influences the interpretation of results.
- Helps understand the underlying biological processes.
- Affects the assumptions of normalization and models used in microbiome analysis.
When we say assumptions, what will happen if the data does not follow the assumed distribution?
- p values of the statistical tests may not be valid.
- Confidence intervals may be misleading.
- Model predictions may be inaccurate.
- Normalization methods may not work as expected.
- Biological interpretations may be flawed.
If we plotted all the values on a graph, the distribution is the curve we would see.
Say for eg., You measured the relative abundance of a bacteria across 100 samples.
- If it is always around 0.2 —> tight, narrow distribution
- If it jumps from 0 to 0.8 wildly —> wide, scattered distribution
- If it is mostly 0, but a few have 0.9 —> zero-inflated, skewed distribution
- If it has multiple peaks —> multimodal distribution
A probability distribution is a mathematical function that describes the likelihood of different outcomes in a random variable.
It provides a complete description of the variability in the data.
For discrete variables, it gives the probability mass function (PMF) i.e., the probability of each possible value.
Probability Mass Function (PMF)
For a discrete random variable $ X $:
\[ P(X = x) = p(x), \quad \text{where} \quad \sum_{x} p(x) = 1 \]
For continuous variables, it gives the probability density function (PDF) that describes the likelihood of values within a range. This is often visualized as a curve. Total area under the curve = 1.
\[ \int_{-\infty}^{\infty} f(x)\, dx = 1 \]
Probability Density Function (PDF)
For a continuous random variable $ X $:
\[ f(x) \geq 0, \quad \text{and} \quad \int_{-\infty}^{\infty} f(x) \, dx = 1 \]
Probability that $ X $ lies in an interval \([a, b]\) is:
\[ P(a \leq X \leq b) = \int_{a}^{b} f(x) \, dx \]
6.1.1 Common types of distributions
- Normal Distribution (Gaussian)
- Bell-shaped curve, symmetric around the mean.
- Characterized by mean (μ) - center and standard deviation (σ) - Spread.
- Many statistical tests assume normality.
\[ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}} \]
As the value of $ $ increases, the curve becomes wider and flatter. The value of $ $ shifts the curve left or right. Interactive Normal Distribution Visualizer
Properties: - Symmetric around the mean. - Mean, median, and mode are equal. (if not, it is not normal) - Approximately 68% of data within 1σ, 95% within 2σ, and 99.7% within 3σ from the mean (Empirical Rule). - A quick approximation of how much data falls within a range of the mean, assuming the data is Normally distributed. - Basis for confidence intervals and outliers detection - Defined by two parameters: mean (μ) and standard deviation (σ). - Unimodal - has a single peak. (if bimodal, it is not normal, it might indicate subgroups or batch effects) - Asymptotic - tails approach the horizontal axis but never touch it. - Skewness = 0 (no skew) - measures asymmetry of a distribution. - Left-skewed → longer tail on the left - Right-skewed → longer tail on the right (Relative abundances (e.g., Prevotella) are often right-skewed - many zeros, few large values) - Kurtosis = 3 (mesokurtic - baseline for measuring “peakedness”). For Normal: - Kurtosis (excess)=0(or full kurtosis = 3) - If kurtosis > 3 → leptokurtic (sharper peak, heavier tails) - If kurtosis < 3 → platykurtic (flatter, lighter tails) High kurtosis → extreme values more common than expected.
- Poisson Distribution
This gives the counts of events in a fixed time or space. if and only if: Events occur independently Happens at a constant average rate $ $ The chance of two events happening at exactly same time is negligible.
How many times will a particular event occur at a fixed time window?
Poisson is a discrete value like the counts/OTUs in the microbiome sample subset.
So the PMF will be,
\[ P(X = k) = \frac{\lambda^k e^{-\lambda}}{k!}, \quad k = 0, 1, 2, \dots \]
k: observed count (0, 1, 2, …) λ (lambda): expected count (mean rate) e: 2.718… (Euler’s number)
Skewness decreases as the $ $ increases. Visualization of Poisson distribution
Note: - This is a non-negative value - For a larger $ $, the distribution looks normal. - Mean = Variance = λ
Though the microbiome count data seems to be a good fit for Poisson distribution, there are challenges in that assumption Poisson has mean == variance. However, microbiome data is overdispersed (variance >>> mean). Also, poisson ignores excess zeroes and compositionality. So, we can use poisson only for theoretical derivations and null simulations.
- Gamma Distribution
This is a continuous distribution used to model positive-only quantities that can be skewed.
Think of it as:
“The distribution of waiting times until a certain number of events occur in a Poisson process.”
\[ f(x; k, \theta) = \frac{x^{k-1} e^{-x/\theta}}{\Gamma(k) \, \theta^k}, \quad x > 0 \]
Gamma Distribution
Where:
- $ k $ = shape parameter (sometimes $ $)
- $ $ = scale parameter (sometimes $ = 1 / $)
- $ (k) $ = Gamma function (generalized factorial: $ (n) = (n-1)! $ for integers)
Mean and Variance:
\[
E[X] = k\theta
\]
\[ \text{Var}(X) = k\theta^2 \]
Special cases:
- If $ k = 1 $, Gamma \(\rightarrow\) Exponential
- Larger $ k $ \(\rightarrow\) more symmetric, less skewed
Gamma in the Negative Binomial
Here’s the key connection:
- Poisson: $ X () $
- $ $ itself is uncertain and follows $ (k, ) $
- Integrating out $ $ \(\rightarrow\) $ X $ follows Negative Binomial
This is how we allow $ $ to vary from sample to sample, introducing overdispersion.
6.1.1.1 Derivation: Poisson + Gamma ⇒ Negative Binomial
We model counts \(X\) via a Poisson likelihood with a random rate \(\lambda\) that follows a Gamma prior.
Conditional model (Poisson):
\[ P(X=k \mid \lambda) \;=\; \frac{e^{-\lambda}\,\lambda^{k}}{k!}, \quad k = 0,1,2,\dots \]
Prior on the rate (Gamma, shape–scale):
\[ \lambda \sim \text{Gamma}(r,\theta) \quad\Longrightarrow\quad f(\lambda) \;=\; \frac{\lambda^{r-1} e^{-\lambda/\theta}}{\Gamma(r)\,\theta^{r}}, \qquad \lambda>0,\; r>0,\; \theta>0. \]
Marginal (unconditional) PMF via mixing:
\[ P(X=k) = \int_{0}^{\infty} P(X=k\mid \lambda)\, f(\lambda)\, d\lambda = \int_{0}^{\infty} \frac{e^{-\lambda}\lambda^{k}}{k!}\cdot \frac{\lambda^{r-1} e^{-\lambda/\theta}}{\Gamma(r)\,\theta^{r}} \, d\lambda. \]
Factor out constants and combine exponents:
\[ P(X=k) = \frac{1}{k!\,\Gamma(r)\,\theta^{r}} \int_{0}^{\infty} \lambda^{k+r-1} \exp\!\left\{-\lambda\Big(1+\tfrac{1}{\theta}\Big)\right\} \, d\lambda. \]
Use the Gamma–integral identity:
For \(a>0,\; b>0\),
\[ \int_{0}^{\infty} \lambda^{a-1} e^{-b\lambda}\, d\lambda = \frac{\Gamma(a)}{b^{a}}. \]
Set \(a = k+r\) and \(b = 1 + \tfrac{1}{\theta}\):
\[ \int_{0}^{\infty} \lambda^{k+r-1} \exp\!\left\{-\lambda\Big(1+\tfrac{1}{\theta}\Big)\right\} \, d\lambda = \frac{\Gamma(k+r)}{\left(1+\tfrac{1}{\theta}\right)^{k+r}}. \]
Therefore,
\[ P(X=k) = \frac{\Gamma(k+r)}{k!\,\Gamma(r)\,\theta^{r}} \cdot \frac{1}{\left(1+\tfrac{1}{\theta}\right)^{k+r}}. \]
Algebra to NB form:
Note \(\displaystyle \frac{1}{\left(1+\tfrac{1}{\theta}\right)^{k+r}} = \left(\frac{\theta}{1+\theta}\right)^{k+r}\).
Hence,
\[ P(X=k) = \frac{\Gamma(k+r)}{k!\,\Gamma(r)}\, \frac{\theta^{k}}{(1+\theta)^{k+r}}. \]
Define \(p = \frac{1}{1+\theta}\) so that \(1-p = \frac{\theta}{1+\theta}\).
Then
\[ P(X=k) = \frac{\Gamma(k+r)}{k!\,\Gamma(r)}\, (1-p)^{k}\, p^{r}, \]
which is the Negative Binomial PMF with parameters \(r>0\) and \(p\in(0,1)\):
\[ X \sim \text{NB}(r, p), \qquad P(X=k)= \binom{k+r-1}{k}(1-p)^k p^r. \]
Mean–variance mapping:
From the Poisson–Gamma mixture,
\[ \mathbb{E}[X] = \mathbb{E}[\lambda] = r\theta, \qquad \mathrm{Var}(X) = \mathbb{E}[\lambda] + \mathrm{Var}(\lambda) = r\theta + r\theta^{2} = \mu + \frac{\mu^{2}}{r}, \]
where \(\mu = r\theta\). Equivalently, in the \((r,p)\) parameterization,
\[ \mathbb{E}[X] = r\frac{1-p}{p}, \qquad \mathrm{Var}(X) = r\frac{1-p}{p^{2}} = \mathbb{E}[X] + \frac{\mathbb{E}[X]^2}{r}. \]
- Negative Binomial Distribution
This is similar to poisson but with better flexibility to handle the variance. Think of this as Poisson on Steroids.
It has a dispersion parameter that says
\[ \mathrm{Var}(X) >> \mathrm{Mean}(X) \]
We toss a biased coin until we get r successes, and count the number of failures.
Or (equivalently), we are counting events (like sequencing reads) where the rate λ itself is not fixed but varies between samples (Gamma-distributed).
\[ P(X=k) = \binom{k+r-1}{k} (1-p)^k \, p^r, \quad k = 0, 1, 2, \dots \]
Where:
- $ r > 0 $ = dispersion (shape or “size” parameter)
- $ p (0,1) $ = probability of success
Mean and Variance (in terms of \(r, p\)):
\[ E[X] = r \frac{1-p}{p}, \qquad \text{Var}(X) = r \frac{1-p}{p^2} \]
Alternative mean–dispersion form (used in DESeq2/edgeR):
\[ P(X=k) = \frac{\Gamma(k+r)}{k!\,\Gamma(r)} \left(\frac{r}{r+\mu}\right)^r \left(\frac{\mu}{r+\mu}\right)^k \]
with
\[ E[X] = \mu, \qquad \text{Var}(X) = \mu + \frac{\mu^2}{r} \]
Limiting cases:
- If $ r $, then $ (X) $ -> Negative Binomial collapses to Poisson.
- If $ r $ is small, variance inflation occurs -> high overdispersion.
Binomial distribution
Beta distribution
The Beta distribution is a continuous probability distribution defined on the interval [0,1].
It is flexible: depending on its parameters (two shape parameters), it can model:
- Bell-shaped distributions
- U-shaped distributions
- Skewed distributions
Because it is bounded, it is ideal for modeling probabilities and proportions.
Think of this as What is the distribution of a probability?
Probability in general is thought as one value, but there is some randomness and uncertainty involved in identifying that probability and we do not know for sure what the true value is.
That’s what is identified by this beta distribution.
\[ f(x; \alpha, \beta) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} x^{\alpha-1} (1-x)^{\beta-1}, \quad 0 < x < 1 \]
Where:
- $ > 0 $ = shape parameter 1
- $ > 0 $ = shape parameter 2
- $ $ = Gamma function (generalized factorial)
Mean and Variance:
\[ E[X] = \frac{\alpha}{\alpha + \beta} \]
\[ \text{Var}(X) = \frac{\alpha \beta}{(\alpha+\beta)^2 (\alpha+\beta+1)} \]
Shape behavior:
- If $ = = 1 $ → Uniform(0,1)
- If $ > $ → Skewed right
- If $ < $ → Skewed left
- If both are large → peaked near the mean (less variance)
- If both are small (\(<1\)) → U-shaped