Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Diversity analysis

Introduction

Microbes have a profound impact on human life, influencing the environment, health, disease, and ecosystems. Understanding the diversity and composition of these ecological units is fundamental to answering broader ecological questions Cassol et al., 2025.

Depending on the scale of the ecological units, diversity is measured at three levels Andermann et al., 2022.

In the QIIME 2 tutorial, the focus is on alpha and beta diversity. The subsequent steps involve constructing a phylogenetic tree and generating core metrics matrices to facilitate these analyses.

Workflow

The output from the DADA2 algorithm, “ASV sequences”, is utilized to compute a rooted phylogenetic tree at the Phylogenetics step. Meanwhile the Sequence table (BIOM count table of ASVs ×\times samples, also generated by DADA2) is stardardized via Rarefaction. This rarefied table is then used to compute core matrices (non-phylogenetic) and integrated with the rooted tree to calculate phylogenetic matrices.

Phylogenetics

The ASV sequences generated by the DADA2 algorithm are aligned using MAFFT v7.526 Katoh & Standley, 2013. By default, the progressive FFT-NS-2 method is employed. In this workflow, a distance matrix is calculated based on shared 6-tuples between pairs of sequences, and a phylogenetic guide tree is constructed using the UPGMA method prior to performing the Multiple Sequence Alignment (MSA).

Following the MSA step, positions are masked to retain if satisfying the gap frequency is 1.0\le 1.0 and the frequency of the most prevalent character (not include gaps) is 0.4\ge 0.4 (default thresholds).

Using the MSA, FastTree v2.2.0 Price et al., 2010 generates an initial unrooted tree via a heuristic Neighbor-Joining (NJ) method. Nearest-Neighbor Interchanges (NNIs) are then employed to explore alternative topologies and branch lengths. These candidate trees serve as the basis for Maximum Likelihood (ML) computations, which statistically estimate the most likely topology and branch lengths to produce the final phylogenetic tree.

Finally, the workflow utilizes the Midpoint Rooting (MPR) method Farris, 1972, implemented in the scikit bio package, to restructure the tree and determine the root. The algorithmic implementation can be referenced in MPR.py.

Rarefaction

Before computing core matrices, the abundance table is randomly subsampled (rarefied) to ensure the total number of sequences across all ASVs in each sample is exactly 1103 (--p-sampling-depth). Samples are discarded if their total sequence count is below this threshold.

Note: The red dash line is the threshold 1103.

As the rarefaction threshold increases, the number of retained samples decreases, while the total sequence count increases due to the greater contribution from highly abundant samples.

Setting the threshold too low results in insufficient data for a robust analysis. Conversely, setting it too high leads to the elimination of too many samples, reducing statistical power. The goal is to select a threshold that retains as many samples as possible while capturing a “large enough” number of sequences to represent the community. This critical step ensures evenness and comparability across all samples in the study.

Computation of core matrices

The refied feature table is used to calculate core matrices:

Matrix nameMeaningCalculation methodReferences
observed featureMeasures the richness (count) of unique ASVs present.Sobs=nS_{\text{obs}} = n
(observed_features.py)
shannon entropyMeasures the complexity in each sampleH=pi×log2(pi)H = -\sum{p_i \times log_2(p_i) }, (shannon_entropy.py)Shannon, 1948
pielou evennessMeasures the normalized complexity (evenness of abundance) in each sampleJ=pi×logn(pi)J = -\sum p_i \times \log_n(p_i), (pielou_evenness.py)Pielou, 1966
jaccardComputes dissimilarity of richness across pairs of samplesSee the method,
(jaccard.py)
Jaccard, 1908
bray curtisComputes dissimilarity of relative abundance distribution across pairs of samplesSee the method,
(braycurtis.py)
Bray & Curtis, 1957

Note:

At the end, the beta diversity (distance) matrices, jaccard and bray curtis, are utilized to calculate principal coordinates through Principal Coordinates Analysis (PCoA) Gower, 1966. This dimensionality reduction technique projects the multi-dimensional distance data into a lower-dimensional space, typically visualized as 2D or 3D plots to reveal ecological patterns.

Computation of phylogenetic matrices

While alpha diversity matrices quantify the richness and abundance of a community, they fail to account for the evolutionary relationships among ASVs.

For instance, consider a scenario where Sample A contains more unique ASVs than Sample B. If all ASVs in Sample A share extremely high sequence similarity, it is difficult to argue that Sample A is truly more diverse than Sample B in a biological sense.

The problem are addressed by phylogenetic matrices, includes:

Matrix nameMeaningCalculation methodReferences
faithpdMeasures total phylogenetic distance among ASVs in each sample.See the method,
(faithpd.py)
Faith, 1992
unweighted_unifracComputes proportion of phylogenetical dissimilarity across pairs of samplesThe ratio of unique branch lengths (exclusive to one sample) over the total branch lengths (union of both samples),
(unweighted_unifrac.py)
Sfiligoi et al., 2022
weighted_unifracComputes the phylogenetic distance according to the ralative abundance across pairs of samplesThe sum of branch lengths weighted by the absolute difference in abundance proportions between two samples,
(weighted_unifrac.py)
Sfiligoi et al., 2022

Similar to the non-phylogenetic core matrices, the distance matrices generated via unweighted and weighted UniFrac are used as inputs for PCoA.

Summary

Identifying the root of the phylogenetic tree is a mechanical necessity for diversity analysis. Faith’s PD requires a directed hierarchy to accumulate branch lengths from successors up to the common ancestor (see the algorithmic code). Furthermore, a rooted tree provides the fixed reference points needed to calculate meaningful shared and unique evolutionary branch lengths between samples in UniFrac matrices.

During this deep dive, I also noticed that MAFFT suspiciously “picks out” specific conserved nucleotide positions in the ASV sequences after MSA (see my MAFFT issue).

Ultimately, while non-phylogenetic (core) matrices treat every ASV as an independent entity, phylogenetic matrices leverage the shared evolutionary history encoded within the sequences to provide a deeper biological context.

References
  1. Cassol, I., Ibañez, M., & Bustamante, J. P. (2025). Key Features and Guidelines for the Application of Microbial Alpha Diversity Metrics. Scientific Reports, 15(1), 622. 10.1038/s41598-024-77864-y
  2. Andermann, T., Antonelli, A., Barrett, R. L., & Silvestro, D. (2022). Estimating Alpha, Beta, and Gamma Diversity Through Deep Learning. Frontiers in Plant Science, 13, 839407. 10.3389/fpls.2022.839407
  3. Katoh, K., & Standley, D. M. (2013). MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution, 30(4), 772–780. 10.1093/molbev/mst010
  4. Price, M. N., Dehal, P. S., & Arkin, A. P. (2010). FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLoS ONE, 5(3), e9490. 10.1371/journal.pone.0009490
  5. Farris, J. S. (1972). Estimating phylogenetic trees from distance matrices. The American Naturalist, 106(951), 645–668.
  6. Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
  7. Pielou, E. C. (1966). The measurement of diversity in different types of biological collections. Journal of Theoretical Biology, 13, 131–144.
  8. Jaccard, P. (1908). Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat., 44, 223–270.
  9. Bray, J. R., & Curtis, J. T. (1957). An ordination of the upland forest communities of southern Wisconsin. Ecological Monographs, 27(4), 326–349.
  10. Gower, J. C. (1966). Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika, 53(3–4), 325–338.
  11. Faith, D. P. (1992). Conservation evaluation and phylogenetic diversity. Biological Conservation, 61(1), 1–10.
  12. Sfiligoi, I., Armstrong, G., Gonzalez, A., McDonald, D., & Knight, R. (2022). Optimizing UniFrac with OpenACC yields greater than one thousand times speed increase. Msystems, 7(3), e00028-22.