Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

DADA2

Introduction

In 16S rRNA analysis, the first step in characterizing the microbiome of a sample is to cluster similar sequencing reads into representative units that can be used for taxonomic assignment. However, sequencing errors can introduce false positives, leading to incorrect biological interpretations. Traditionally, this clustering is performed by grouping reads into Operational Taxonomic Units (OTUs) based on a fixed sequence dissimilarity threshold. Nevertheless, this approach often lacks sufficient resolution to capture fine-scale taxonomic variation and can result in misinterpretation of microbial diversity Callahan et al., 2016.

DADA2 address this issue by attempting to correct sequencing errors at the single-nucleotide level without constructing OTUs. Instead, they generate Amplicon Sequence Variants (ASVs) Nearing et al., 2018.

Workflow

dada_workflow

The workflow begins with demultiplexed FASTQ samples undergoing Trim & Filtering to remove low-quality data, followed by Learning Error Rates to estimate error rate between unique sequence and their partition center sequence (the most abundant in a variant group).

The process then enters the Denoise samples stage where a specialized algorithm performs constructing priors from unique sequences, and calculating pp-values to distinguish and remove sequencing noise from biological variants.

Once the samples are denoised, the pipeline handles Chimeras Removal to eliminate chimeras (unique sequence belongs two samples, originating from PCR) before reporting the final results as artifacts for downstream analysis.

Trim & Filtering

Demutiplexed files are filtered simultaneously according to following quality filtering criteria:

CriterionValueDescription
truncLen120Truncates reads at 120 bases. Reads shorter than this are discarded.
truncQ2Truncates reads at the first instance of a quality score \le 2.
maxEE2Discards reads with > 2 expected errors after truncation.
maxN0Discards sequences containing any ambiguous bases (Ns).
minLen20Minimum read length required after trimming/truncation.
trimLeft0No nucleotides removed from the start of the read.
trimRight0No nucleotides removed from the end of the read.
maxLenInfNo maximum length limit enforced on raw reads.
minQ0No specific minimum quality score threshold enforced (beyond truncQ).
rm.phixTRUEDiscards reads matching the phiX genome.

The values of trimLeft and truncLen are inherited from the command. Criteria descriptions adapted from dada2::fastqFilter documentation (function handles trim and filtering directly).

DADA2 core

dada_core

The DADA2 core algorithm (named “The divisive partitioning algorithm”) Callahan et al., 2016 begins by processing unique sequences obtained through dereplication, maintaining their respective abundance counts. To optimize the alignment step, the algorithm constructs k-mers and compute distance of pair of sequences to screen out unrelated pairs Sun et al., 2009.

During the Alignment & Compute λ\lambda phase, this k-mer distance acts as a pre-filter before the banded Needleman-Wunsch algorithm Gibrat, 2018 performs ends-free global pairwise alignments. From these alignments, the error rate (λji\lambda_{ji}) is calculated as the joint probability of the observed sequence ii originating from the partition center sequence jj over the total sequence length LL. This transition probability is determined by the specific nucleotide substitution and the associated quality score qiq_i at each position ll.

λij=l=0Lp(j(l)i(l),qi(l))\lambda_{ij} = \prod^{L}_{l=0} p(j(l) \rightarrow i(l), q_i(l))

Under the assumption that sequencing errors occur independently across reads, where any read from a true sequence jj can potentially be misread as sequence ii. The abundance pp-value is calculated as the probability of observing an abundance aia_i or greater, given the expected error count njλjin_j\lambda_{ji}. This value is computed using a Poisson distribution, normalized by the probability of observing sequence ii at least once.

pA(ji)=11ρpois(njλji,0)a=ajρpois(njλji,ai)p_A(j \rightarrow i) = \frac{1}{1-\rho_{\text{pois}}(n_j\lambda_{ji},0)} \sum_{a=a_j}^{\infty}\rho_{\text{pois}}(n_j\lambda_{ji},a_i)

A lower pp-value indicates a lower probability that the observed abundance of sequence ii can be explained by stochastic sequencing errors originating from sequence jj. Therefore, a small pp-value provides strong evidence that sequence ii is a distinct biological variant.

Once the pp-values are computed, at the Shuffle sequences and update pp-value step, any sequence with a pp-value below the threshold (ΩA\Omega_A) triggers the formation of a new partition with that sequence as its center.

The algorithm then enters an iterative refinement loop: pp-values for all unique sequences are re-calculated against the new set of centers, and sequences are reshuffled (reassigned) to the partitions for which they have the highest likelihood (most probable origin). This cycle of partitioning and reshuffling continues until the composition of the partitions remains stable and none of pp-value violates the threshold.

Learning Error Rates

The process of learning error rates begins with dereplication to collapse redundant reads into unique sequences. Next, the algorithm computes and extracts error rates by executing the DADA2 core algorithm through the “Calculate pp-value” step. Notably, the iterative refining loop is omitted during this phase.

Denoise samples

The denoising step utilizes the error rates estimated in the previous stage and performs dereplication, and the DADA2 core algorithm again. This phase focuses on detecting new center sequences (ASVs) and stabilizing partitions through the final iterative refinement loop, ensuring each sequence is assigned to its most likely biological origin.

Chimeras Removal

Output from denosing including center sequences (ASVs) and associated with their abundance values, from this a sequence table is compiled where columns represent unique sequences, and rows represent the samples (in our case is from 1 to 34), with the cells filled by abundance values.

When running in consensus mode, bimera identification is treated as a voting process across samples. A sequence is flagged within each individual sample where it appears; if the sequence is flagged in all samples, or in a sufficiently high fraction of them (determined by a threshold), it is definitively identified as a bimera and removed.

Summary

DADA2 operates on the fundamental assumption that abundance of error following Poisson distribution. A unique sequence is only promoted to a new ASV if its abundance is significantly higher than the expected error rate of its parent “center” sequence.

The DADA2 Core algorithm is applied separately among samples with two phases. Currently the algorithm run twice, maybe in future it could be restructured to perform for specific task without overlapping the each other (pp-value is computed uneccessarily by Learning Error Rates).

The accuracy of the Chimeras Removal step is directly enhanced by the number of samples processed. As the sample count increases, the statistical power of “vote” grows. More samples provide a higher probability of detecting the sequence which has real biological meaning.

References
  1. Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: High-resolution Sample Inference from Illumina Amplicon Data. Nature Methods, 13(7), 581–583. 10.1038/nmeth.3869
  2. Nearing, J. T., Douglas, G. M., Comeau, A. M., & Langille, M. G. I. (2018). Denoising the Denoisers: An Independent Evaluation of Microbiome Sequence Error-Correction Approaches. PeerJ, 6, e5364. 10.7717/peerj.5364
  3. Sun, Y., Cai, Y., Liu, L., Yu, F., Farrell, M. L., McKendree, W., & Farmerie, W. (2009). ESPRIT: Estimating Species Richness Using Large Collections of 16S rRNA Pyrosequences. Nucleic Acids Research, 37(10), e76–e76. 10.1093/nar/gkp285
  4. Gibrat, J.-F. (2018). A Short Note on Dynamic Programming in a Band. BMC Bioinformatics, 19(1), 226. 10.1186/s12859-018-2228-9