Diversity analysis - QIIME2 Blog

Introduction¶

Microbes have a profound impact on human life, influencing the environment, health, disease, and ecosystems. Understanding the diversity and composition of these ecological units is fundamental to answering broader ecological questions Cassol et al., 2025.

Depending on the scale of the ecological units, diversity is measured at three levels Andermann et al., 2022.

Alpha diversity: Describes the richness and evenness within a single community.
Beta diversity: Analyzes the differentiation between distinct communities.
Gamma diversity: Studies diversity at a larger landscape or geographic scale.

In the QIIME 2 tutorial, the focus is on alpha and beta diversity. The subsequent steps involve constructing a phylogenetic tree and generating core metrics matrices to facilitate these analyses.

The command in the tutorial

Phylogenetics

Generating matrices

qiime phylogeny align-to-tree-mafft-fasttree \
  --i-sequences rep-seqs.qza \
  --o-alignment aligned-rep-seqs.qza \
  --o-masked-alignment masked-aligned-rep-seqs.qza \
  --o-tree unrooted-tree.qza \
  --o-rooted-tree rooted-tree.qza

Command explanation:

Inputs:
- --i-sequences: The representative sequences (ASVs) to be aligned. Note: Output of DADA2.
Outputs:
- --o-alignment: The unmasked Multiple Sequence Alignment (MSA).
- --o-masked-alignment: The filtered (masked) MSA, used to build the tree.
- --o-tree: The resulting unrooted phylogenetic tree.
- --o-rooted-tree: The final rooted phylogenetic tree (midpoint rooted). Note: Used for Diversity analysis.

qiime diversity core-metrics-phylogenetic \
  --i-phylogeny rooted-tree.qza \
  --i-table table.qza \
  --p-sampling-depth 1103 \
  --m-metadata-file sample-metadata.tsv \
  --output-dir diversity-core-metrics-phylogenetic

Command explanation:

Inputs:
- --i-phylogeny: The rooted phylogenetic tree. Note: Output of the phylogeny step.
- --i-table: The feature table containing counts of features per sample. Note: Output of DADA2.
Parameters:
- --p-sampling-depth: The total frequency to which each sample should be rarefied.
- --m-metadata-file: The sample metadata file for PCoA visualization.
Outputs:
- --output-dir: The directory where all diversity results (alpha, beta, PCoA plots) are saved.

Workflow¶

The output from the DADA2 algorithm, “ASV sequences”, is utilized to compute a rooted phylogenetic tree at the Phylogenetics step. Meanwhile the Sequence table (BIOM count table of ASVs $\times$ samples, also generated by DADA2) is stardardized via Rarefaction. This rarefied table is then used to compute core matrices (non-phylogenetic) and integrated with the rooted tree to calculate phylogenetic matrices.

Phylogenetics¶

The ASV sequences generated by the DADA2 algorithm are aligned using MAFFT v7.526 Katoh & Standley, 2013. By default, the progressive FFT-NS-2 method is employed. In this workflow, a distance matrix is calculated based on shared 6-tuples between pairs of sequences, and a phylogenetic guide tree is constructed using the UPGMA method prior to performing the Multiple Sequence Alignment (MSA).

Following the MSA step, positions are masked to retain if satisfying the gap frequency is $\le 1.0$ and the frequency of the most prevalent character (not include gaps) is $\ge 0.4$ (default thresholds).

Using the MSA, FastTree v2.2.0 Price et al., 2010 generates an initial unrooted tree via a heuristic Neighbor-Joining (NJ) method. Nearest-Neighbor Interchanges (NNIs) are then employed to explore alternative topologies and branch lengths. These candidate trees serve as the basis for Maximum Likelihood (ML) computations, which statistically estimate the most likely topology and branch lengths to produce the final phylogenetic tree.

Finally, the workflow utilizes the Midpoint Rooting (MPR) method Farris, 1972, implemented in the scikit bio package, to restructure the tree and determine the root. The algorithmic implementation can be referenced in MPR.py.

Rarefaction¶

Before computing core matrices, the abundance table is randomly subsampled (rarefied) to ensure the total number of sequences across all ASVs in each sample is exactly 1103 (--p-sampling-depth). Samples are discarded if their total sequence count is below this threshold.

Note: The red dash line is the threshold 1103.

As the rarefaction threshold increases, the number of retained samples decreases, while the total sequence count increases due to the greater contribution from highly abundant samples.

Setting the threshold too low results in insufficient data for a robust analysis. Conversely, setting it too high leads to the elimination of too many samples, reducing statistical power. The goal is to select a threshold that retains as many samples as possible while capturing a “large enough” number of sequences to represent the community. This critical step ensures evenness and comparability across all samples in the study.

Computation of core matrices¶

The refied feature table is used to calculate core matrices:

Alpha matrices: Measure diversity within individual samples (observed feature, shannon entropy, pielou evenness)
Beta matrices: Compute dissimilarity matrices, containing distance between pairs of samples (jaccard, bray curtis)

Matrix name	Meaning	Calculation method	References
observed feature	Measures the richness (count) of unique ASVs present.	$S_{\text{obs}} = n$ (observed_features.py)
shannon entropy	Measures the complexity in each sample	$H = -\sum{p_i \times log_2(p_i) }$ , (shannon_entropy.py)	Shannon, 1948
pielou evenness	Measures the normalized complexity (evenness of abundance) in each sample	$J = -\sum p_i \times \log_n(p_i)$ , (pielou_evenness.py)	Pielou, 1966
jaccard	Computes dissimilarity of richness across pairs of samples	See the method, (jaccard.py)	Jaccard, 1908
bray curtis	Computes dissimilarity of relative abundance distribution across pairs of samples	See the method, (braycurtis.py)	Bray & Curtis, 1957

Note:

$p_i$ : The relative abundance (probability) of $\text{ASV}_i$ in a sample, where $p_i = \frac{\text{\#ASV}_i}{\text{\#total sequences}}$ . (#total_sequences is 1103 in our case)
$n$ : The total number of observed ASVs in the sample.

At the end, the beta diversity (distance) matrices, jaccard and bray curtis, are utilized to calculate principal coordinates through Principal Coordinates Analysis (PCoA) Gower, 1966. This dimensionality reduction technique projects the multi-dimensional distance data into a lower-dimensional space, typically visualized as 2D or 3D plots to reveal ecological patterns.

Computation of phylogenetic matrices¶

While alpha diversity matrices quantify the richness and abundance of a community, they fail to account for the evolutionary relationships among ASVs.

For instance, consider a scenario where Sample A contains more unique ASVs than Sample B. If all ASVs in Sample A share extremely high sequence similarity, it is difficult to argue that Sample A is truly more diverse than Sample B in a biological sense.

The problem are addressed by phylogenetic matrices, includes:

Alpha matrices: faithpd
Beta matrices: unweighted_unifrac and weighted_unifrac

Matrix name	Meaning	Calculation method	References
faithpd	Measures total phylogenetic distance among ASVs in each sample.	See the method, (faithpd.py)	Faith, 1992
unweighted_unifrac	Computes proportion of phylogenetical dissimilarity across pairs of samples	The ratio of unique branch lengths (exclusive to one sample) over the total branch lengths (union of both samples), (unweighted_unifrac.py)	Sfiligoi et al., 2022
weighted_unifrac	Computes the phylogenetic distance according to the ralative abundance across pairs of samples	The sum of branch lengths weighted by the absolute difference in abundance proportions between two samples, (weighted_unifrac.py)	Sfiligoi et al., 2022

Similar to the non-phylogenetic core matrices, the distance matrices generated via unweighted and weighted UniFrac are used as inputs for PCoA.

Summary¶

Identifying the root of the phylogenetic tree is a mechanical necessity for diversity analysis. Faith’s PD requires a directed hierarchy to accumulate branch lengths from successors up to the common ancestor (see the algorithmic code). Furthermore, a rooted tree provides the fixed reference points needed to calculate meaningful shared and unique evolutionary branch lengths between samples in UniFrac matrices.

During this deep dive, I also noticed that MAFFT suspiciously “picks out” specific conserved nucleotide positions in the ASV sequences after MSA (see my MAFFT issue).

Ultimately, while non-phylogenetic (core) matrices treat every ASV as an independent entity, phylogenetic matrices leverage the shared evolutionary history encoded within the sequences to provide a deeper biological context.

References¶

Cassol, I., Ibañez, M., & Bustamante, J. P. (2025). Key Features and Guidelines for the Application of Microbial Alpha Diversity Metrics. Scientific Reports, 15(1), 622. 10.1038/s41598-024-77864-y
Andermann, T., Antonelli, A., Barrett, R. L., & Silvestro, D. (2022). Estimating Alpha, Beta, and Gamma Diversity Through Deep Learning. Frontiers in Plant Science, 13, 839407. 10.3389/fpls.2022.839407
Katoh, K., & Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution, 30(4), 772–780. 10.1093/molbev/mst010
Price, M. N., Dehal, P. S., & Arkin, A. P. (2010). FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3), e9490. 10.1371/journal.pone.0009490
Farris, J. S. (1972). Estimating phylogenetic trees from distance matrices. The American Naturalist, 106(951), 645–668.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. Journal of Theoretical Biology, 13, 131–144.
Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, 223–270.
Bray, J. R., & Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4), 326–349.
Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3–4), 325–338.
Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1), 1–10.
Sfiligoi, I., Armstrong, G., Gonzalez, A., McDonald, D., & Knight, R. (2022). Optimizing UniFrac with OpenACC yields greater than one thousand times speed increase. Msystems, 7(3), e00028-22.