Minimac4
|
Handles writing genotype and dosage data to VCF/BCF/SAV files. More...
#include <dosage_writer.hpp>
Public Member Functions | |
dosage_writer (const std::string &file_path, const std::string &emp_file_path, const std::string &sites_file_path, savvy::file::format file_format, std::uint8_t out_compression, const std::vector< std::string > &sample_ids, const std::vector< std::string > &fmt_fields, const std::string &chromosome, float min_r2, bool is_temp) | |
Construct a new dosage_writer. | |
bool | merge_temp_files (std::list< savvy::reader > &temp_files, std::list< savvy::reader > &temp_emp_files) |
Merge temporary HDS and empirical dosage files into the final output. | |
bool | merge_temp_files (std::list< std::string > &temp_file_paths, std::list< std::string > &temp_emp_file_paths) |
bool | write_dosages (const full_dosages_results &hmm_results, const std::vector< target_variant > &tar_variants, const std::vector< target_variant > &tar_only_variants, std::pair< std::size_t, std::size_t > observed_range, const reduced_haplotypes &full_reference_data, const savvy::region &impute_region) |
Write imputed dosages and observed genotypes to output files. | |
void | print_mean_er2 (std::ostream &os) const |
Print the mean empirical R² (ER2) values to an output stream. | |
Handles writing genotype and dosage data to VCF/BCF/SAV files.
This class manages writing variant information including dosages (HDS, LDS, DS), genotype calls (GT), quality metrics (R2, ER2, call score), and optional empirical dosage data. It supports temporary file merging, FORMAT/INFO field generation, and optional output of site information.
The class also tracks imputation accuracy statistics for ER2 calculation.
dosage_writer::dosage_writer | ( | const std::string & | file_path, |
const std::string & | emp_file_path, | ||
const std::string & | sites_file_path, | ||
savvy::file::format | file_format, | ||
std::uint8_t | out_compression, | ||
const std::vector< std::string > & | sample_ids, | ||
const std::vector< std::string > & | fmt_fields, | ||
const std::string & | chromosome, | ||
float | min_r2, | ||
bool | is_temp ) |
Construct a new dosage_writer.
file_path | Path to main output file (VCF/BCF/SAV) |
emp_file_path | Optional path for empirical dosage output |
sites_file_path | Optional path for site info output |
file_format | Output file format (VCF/BCF/SAV) |
out_compression | Compression level for output file |
sample_ids | List of sample IDs for the output |
fmt_fields | List of FORMAT fields to include (GT, DS, GP, SD, HDS) |
chromosome | Chromosome identifier for VCF header |
min_r2 | Minimum R2 threshold for output variants |
is_temp | Whether the output is temporary (affects INFO/FORMAT fields) |
bool dosage_writer::merge_temp_files | ( | std::list< savvy::reader > & | temp_files, |
std::list< savvy::reader > & | temp_emp_files ) |
Merge temporary HDS and empirical dosage files into the final output.
This function reads records from multiple temporary files generated by dosage_writer
, merges the INFO and FORMAT fields (HDS, LDS, GT), calculates allele frequencies, call scores, R2, and ER2 metrics, and writes the combined data to the final output files.
The function supports optional empirical output (emp_out_file_
) and site information output (sites_out_file_
). It also updates internal accuracy statistics for ER2 calculation.
temp_files | A list of readers for temporary dosage files (HDS, GT). |
temp_emp_files | A list of readers for temporary empirical dosage files (LDS, GT). |
bool dosage_writer::merge_temp_files | ( | std::list< std::string > & | temp_file_paths, |
std::list< std::string > & | temp_emp_file_paths ) |
void dosage_writer::print_mean_er2 | ( | std::ostream & | os | ) | const |
Print the mean empirical R² (ER2) values to an output stream.
This function computes and writes the mean ER2 values across all accuracy statistic bins that have been tracked during imputation. For each bin, the mean ER2 is calculated as:
\[ \text{mean ER2} = \frac{\text{er2_sum}}{\text{n_var}} \]
where er2_sum
is the sum of empirical R² values for the bin, and n_var
is the number of variants contributing to the bin. If a bin contains no variants (n_var == 0
), a dot ("."
) is printed instead.
The output is printed as a single line, prefixed by "Mean ER2:"
, followed by space-separated values (or dots) for each bin.
Example output:
os | The output stream to which the mean ER2 values should be written. |
accuracy_stats_
has been populated during the imputation process. bool dosage_writer::write_dosages | ( | const full_dosages_results & | hmm_results, |
const std::vector< target_variant > & | tar_variants, | ||
const std::vector< target_variant > & | tar_only_variants, | ||
std::pair< std::size_t, std::size_t > | observed_range, | ||
const reduced_haplotypes & | full_reference_data, | ||
const savvy::region & | impute_region ) |
Write imputed dosages and observed genotypes to output files.
This function iterates over the reference variants and target variants within the imputation region, writing imputed dosages and observed genotype data into VCF/BCF output streams. It handles both typed (observed) target variants and imputed-only variants, attaches INFO/FORMAT annotations, and optionally writes additional site and empirical dosage outputs.
Core responsibilities include:
hmm_results | Container with HMM imputation results, including dosages and leave-one-out (LOO) dosages for each reference variant. |
tar_variants | List of target variants that overlap with the reference panel (typed genotypes available for comparison). |
tar_only_variants | List of target variants not present in the reference panel (typed-only sites that need to be preserved). |
observed_range | Pair of indices defining the [first, last) sample range for which observed genotypes should be considered (supports batch writing). |
full_reference_data | Complete reference haplotype data (reduced representation), used to iterate through all reference variants in the region. |
impute_region | Genomic region currently being imputed (chromosome, start, end). |
The algorithm proceeds in several phases:
hmm_results
.has_good_r2
) are written to the primary output stream and optionally to a site-only output stream.Variants are written in VCF/BCF format, with: