Minimac4
Loading...
Searching...
No Matches
dosage_writer Class Reference

Handles writing genotype and dosage data to VCF/BCF/SAV files. More...

#include <dosage_writer.hpp>

Public Member Functions

 dosage_writer (const std::string &file_path, const std::string &emp_file_path, const std::string &sites_file_path, savvy::file::format file_format, std::uint8_t out_compression, const std::vector< std::string > &sample_ids, const std::vector< std::string > &fmt_fields, const std::string &chromosome, float min_r2, bool is_temp)
 Construct a new dosage_writer.
 
bool merge_temp_files (std::list< savvy::reader > &temp_files, std::list< savvy::reader > &temp_emp_files)
 Merge temporary HDS and empirical dosage files into the final output.
 
bool merge_temp_files (std::list< std::string > &temp_file_paths, std::list< std::string > &temp_emp_file_paths)
 
bool write_dosages (const full_dosages_results &hmm_results, const std::vector< target_variant > &tar_variants, const std::vector< target_variant > &tar_only_variants, std::pair< std::size_t, std::size_t > observed_range, const reduced_haplotypes &full_reference_data, const savvy::region &impute_region)
 Write imputed dosages and observed genotypes to output files.
 
void print_mean_er2 (std::ostream &os) const
 Print the mean empirical R² (ER2) values to an output stream.
 

Detailed Description

Handles writing genotype and dosage data to VCF/BCF/SAV files.

This class manages writing variant information including dosages (HDS, LDS, DS), genotype calls (GT), quality metrics (R2, ER2, call score), and optional empirical dosage data. It supports temporary file merging, FORMAT/INFO field generation, and optional output of site information.

The class also tracks imputation accuracy statistics for ER2 calculation.

Constructor & Destructor Documentation

◆ dosage_writer()

dosage_writer::dosage_writer ( const std::string & file_path,
const std::string & emp_file_path,
const std::string & sites_file_path,
savvy::file::format file_format,
std::uint8_t out_compression,
const std::vector< std::string > & sample_ids,
const std::vector< std::string > & fmt_fields,
const std::string & chromosome,
float min_r2,
bool is_temp )

Construct a new dosage_writer.

Parameters
file_pathPath to main output file (VCF/BCF/SAV)
emp_file_pathOptional path for empirical dosage output
sites_file_pathOptional path for site info output
file_formatOutput file format (VCF/BCF/SAV)
out_compressionCompression level for output file
sample_idsList of sample IDs for the output
fmt_fieldsList of FORMAT fields to include (GT, DS, GP, SD, HDS)
chromosomeChromosome identifier for VCF header
min_r2Minimum R2 threshold for output variants
is_tempWhether the output is temporary (affects INFO/FORMAT fields)

Member Function Documentation

◆ merge_temp_files() [1/2]

bool dosage_writer::merge_temp_files ( std::list< savvy::reader > & temp_files,
std::list< savvy::reader > & temp_emp_files )

Merge temporary HDS and empirical dosage files into the final output.

This function reads records from multiple temporary files generated by dosage_writer, merges the INFO and FORMAT fields (HDS, LDS, GT), calculates allele frequencies, call scores, R2, and ER2 metrics, and writes the combined data to the final output files.

The function supports optional empirical output (emp_out_file_) and site information output (sites_out_file_). It also updates internal accuracy statistics for ER2 calculation.

Parameters
temp_filesA list of readers for temporary dosage files (HDS, GT).
temp_emp_filesA list of readers for temporary empirical dosage files (LDS, GT).
Returns
true if merging succeeded and all files are consistent.
false if any I/O error occurs or record/ploidy mismatches are detected.
Note
The function assumes all temporary files have the same number of records and consistent ploidy per sample. Any mismatch will cause an error and abort.
The merged HDS vector is used to generate the final FORMAT fields, including DS, HDS, GP, and SD. INFO fields are recalculated across all temporary files and may replace the intermediate S_X, S_XX, S_CS, and AN fields.
For ER2 calculation, missing LOO_S_YY values are currently approximated by LOO_S_Y. This may need adjustment if missing values are available in temp files.
Warning
Only haploid and diploid samples are fully supported for SD/GP calculations.
Any inconsistencies in record counts across temp files or mismatch between HDS, LDS, and GT vectors will abort the merge.
Here is the caller graph for this function:

◆ merge_temp_files() [2/2]

bool dosage_writer::merge_temp_files ( std::list< std::string > & temp_file_paths,
std::list< std::string > & temp_emp_file_paths )

◆ print_mean_er2()

void dosage_writer::print_mean_er2 ( std::ostream & os) const

Print the mean empirical R² (ER2) values to an output stream.

This function computes and writes the mean ER2 values across all accuracy statistic bins that have been tracked during imputation. For each bin, the mean ER2 is calculated as:

\[ \text{mean ER2} = \frac{\text{er2_sum}}{\text{n_var}} \]

where er2_sum is the sum of empirical R² values for the bin, and n_var is the number of variants contributing to the bin. If a bin contains no variants (n_var == 0), a dot (".") is printed instead.

The output is printed as a single line, prefixed by "Mean ER2:", followed by space-separated values (or dots) for each bin.

Example output:

Mean ER2: 0.85 0.72 . 0.90
Parameters
osThe output stream to which the mean ER2 values should be written.
Note
  • This function does not modify internal state.
  • Requires that accuracy_stats_ has been populated during the imputation process.
Here is the caller graph for this function:

◆ write_dosages()

bool dosage_writer::write_dosages ( const full_dosages_results & hmm_results,
const std::vector< target_variant > & tar_variants,
const std::vector< target_variant > & tar_only_variants,
std::pair< std::size_t, std::size_t > observed_range,
const reduced_haplotypes & full_reference_data,
const savvy::region & impute_region )

Write imputed dosages and observed genotypes to output files.

This function iterates over the reference variants and target variants within the imputation region, writing imputed dosages and observed genotype data into VCF/BCF output streams. It handles both typed (observed) target variants and imputed-only variants, attaches INFO/FORMAT annotations, and optionally writes additional site and empirical dosage outputs.

Core responsibilities include:

  • Aligning reference haplotype variants with target variants.
  • Writing imputed dosages for all reference variants in the region.
  • Handling target-only variants (typed but not in reference).
  • Writing empirical leave-one-out (LOO) dosages when available.
  • Applying QC filters (e.g., r² thresholds) before writing.
  • Outputting to multiple streams: primary, site-only, and empirical.
Parameters
hmm_resultsContainer with HMM imputation results, including dosages and leave-one-out (LOO) dosages for each reference variant.
tar_variantsList of target variants that overlap with the reference panel (typed genotypes available for comparison).
tar_only_variantsList of target variants not present in the reference panel (typed-only sites that need to be preserved).
observed_rangePair of indices defining the [first, last) sample range for which observed genotypes should be considered (supports batch writing).
full_reference_dataComplete reference haplotype data (reduced representation), used to iterate through all reference variants in the region.
impute_regionGenomic region currently being imputed (chromosome, start, end).
Returns
true if the primary output file stream is still in a good state after writing all sites; false otherwise.

The algorithm proceeds in several phases:

  1. Skip leading variants: Both target and target-only iterators are advanced until they reach the start of the imputation region.
  2. Iterate through reference panel:
    • For each reference variant:
      • Write any target-only variants that occur before or at this position.
      • Build an output site with imputed dosages from hmm_results.
      • If the site matches a typed target variant:
        • Add observed genotypes and leave-one-out dosages to INFO/FORMAT fields.
        • Optionally, write an empirical site record (typed + imputed).
      • If the site does not match a target variant:
        • Write only imputed dosages.
      • Sites passing QC (has_good_r2) are written to the primary output stream and optionally to a site-only output stream.
  3. Flush trailing target-only variants:
    • After finishing the reference loop, remaining target-only variants up to the end of the imputation region are written similarly.

Variants are written in VCF/BCF format, with:

  • INFO fields (imputation quality metrics, typed/imputed flags).
  • FORMAT fields (dosages, genotypes, LOO dosages).
Note
  • INFO fields written include imputation quality metrics (e.g., Rsq).
  • FORMAT fields written include dosages (DS) and optionally observed genotypes (GT).
  • Sites are only written if they pass quality filters (e.g., has_good_r2()).
  • Empirical outputs are written when both typed and imputed data are available.
Here is the call graph for this function:
Here is the caller graph for this function:

The documentation for this class was generated from the following files: