Minimac4
Loading...
Searching...
No Matches
input_prep.hpp File Reference
#include "unique_haplotype.hpp"
#include <savvy/reader.hpp>
Include dependency graph for input_prep.hpp:
This graph shows which files directly or indirectly include this file:

Go to the source code of this file.

Functions

bool stat_tar_panel (const std::string &tar_file_path, std::vector< std::string > &sample_ids)
 Extract sample IDs from a target panel file.
 
bool stat_ref_panel (const std::string &ref_file_path, std::string &chrom, std::uint64_t &end_pos)
 Inspect a reference panel file to determine chromosome and end position.
 
bool load_target_haplotypes (const std::string &file_path, const savvy::genomic_region &reg, std::vector< target_variant > &target_sites, std::vector< std::string > &sample_ids)
 Load haplotypes from a target file for a given genomic region.
 
bool load_reference_haplotypes (const std::string &file_path, const savvy::genomic_region &extended_reg, const savvy::genomic_region &impute_reg, const std::unordered_set< std::string > &subset_ids, std::vector< target_variant > &target_sites, reduced_haplotypes &typed_only_reference_data, reduced_haplotypes &full_reference_data, genetic_map_file *map_file, float min_recom, float default_match_error)
 Load and process reference haplotypes from an MVCF file.
 
bool load_reference_haplotypes_old_recom_approach (const std::string &file_path, const savvy::genomic_region &extended_reg, const savvy::genomic_region &impute_reg, const std::unordered_set< std::string > &subset_ids, std::vector< target_variant > &target_sites, reduced_haplotypes &typed_only_reference_data, reduced_haplotypes &full_reference_data, genetic_map_file *map_file)
 Loads reference haplotypes using an older recombination-based approach.
 
std::vector< target_variantseparate_target_only_variants (std::vector< target_variant > &target_sites)
 Separates target-only variants from those found in the reference panel.
 
bool load_variant_hmm_params (std::vector< target_variant > &tar_variants, reduced_haplotypes &typed_only_reference_data, float default_error_param, float recom_min, const std::string &map_file_path)
 Loads Hidden Markov Model (HMM) parameters for target variants.
 
std::vector< std::vector< std::vector< std::size_t > > > generate_reverse_maps (const reduced_haplotypes &typed_only_reference_data)
 Generates reverse mapping tables for reduced haplotype blocks.
 
bool convert_old_m3vcf (const std::string &input_path, const std::string &output_path, const std::string &map_file_path="")
 Converts an old M3VCF file (v1/v2) to a newer VCF-like format (MVCFv3.0).
 
bool compress_reference_panel (const std::string &input_path, const std::string &output_path, std::size_t min_block_size=10, std::size_t max_block_size=0xFFFF, std::size_t slope_unit=10, const std::string &map_file_path="")
 Compress a haplotype reference panel into blocks and write to an output file.
 

Function Documentation

◆ compress_reference_panel()

bool compress_reference_panel ( const std::string & input_path,
const std::string & output_path,
std::size_t min_block_size = 10,
std::size_t max_block_size = 0xFFFF,
std::size_t slope_unit = 10,
const std::string & map_file_path = "" )

Compress a haplotype reference panel into blocks and write to an output file.

This function reads a phased reference panel from an input file (VCF/BCF/SAV format), compresses haplotypes into blocks of unique haplotypes, and writes the result to an output file in SAV/BCF format. It adaptively determines when to flush blocks based on compression ratio and block size constraints.

Parameters
input_pathPath to the input reference panel file (VCF/BCF/SAV).
output_pathPath to the compressed output file (SAV/BCF).
min_block_sizeMinimum number of variants required in a block before flushing.
max_block_sizeMaximum number of variants allowed in a block before forcing flush.
slope_unitInterval of variants used to check compression ratio slope.
map_file_pathPath to a genetic map file (currently unused, reserved for CM filling).
Returns
true if compression and writing completed successfully, false otherwise.
Note
  • Input file must contain fully phased genotypes (phasing header must not be "none" or "partial").
  • INFO and FORMAT fields required for compressed blocks are automatically added to headers.
  • Compression ratio is defined as:

    \[ CR = \frac{\text{expanded haplotype size} + (\text{unique haplotype size} \times \text{variant size})} {\text{expanded haplotype size} \times \text{variant size}} \]

See also
unique_haplotype_block, reference_site_info
Here is the call graph for this function:
Here is the caller graph for this function:

◆ convert_old_m3vcf()

bool convert_old_m3vcf ( const std::string & input_path,
const std::string & output_path,
const std::string & map_file_path = "" )

Converts an old M3VCF file (v1/v2) to a newer VCF-like format (MVCFv3.0).

This function reads an input M3VCF file, updates its headers and records, and writes them to an output file in a modernized format compatible with downstream tools. Optionally, a genetic map file may be provided to annotate recombination positions.

Conversion steps include:

  • Parsing and updating VCF/M3VCF header lines.
  • Ensuring presence of required phasing and contig headers.
  • Adding INFO and FORMAT metadata fields required by MVCFv3.0.
  • Reading sample IDs from the column header line.
  • Deserializing haplotype blocks from the input file.
  • Optionally annotating variants with centimorgan (cM) positions if a map file is provided.
  • Serializing blocks to the output file in the new format.
Parameters
[in]input_pathPath to the old M3VCF file (gzipped).
[in]output_pathPath to the output file (can be .bcf or .sav).
[in]map_file_pathOptional path to a genetic map file for cM annotation.
Returns
True if the conversion completed successfully, false otherwise.
Exceptions
std::runtime_errorIf input or output files cannot be opened.
Note
  • Supports M3VCF v1 and v2 input.
  • Output headers are modified to include MVCFv3.0 metadata.
  • If map_file_path is non-empty, records will include cM positions.
  • The function ensures phasing and contig headers exist.
  • Sample IDs are extracted from the first non-header line of the input.

@complexity

  • Time: O(B × V), where B = number of blocks, V = number of variants per block.
  • Memory: O(N), proportional to the size of haplotype data buffered during conversion.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ generate_reverse_maps()

std::vector< std::vector< std::vector< std::size_t > > > generate_reverse_maps ( const reduced_haplotypes & typed_only_reference_data)

Generates reverse mapping tables for reduced haplotype blocks.

This function constructs a three-level nested vector structure (reverse_maps) that provides, for each block and each allele state in that block, the list of haplotype indices that map to it.

The mapping is inverted from the unique_map() representation in each block of the typed_only_reference_data.

Structure of the returned vector:

  • reverse_maps[block_idx][allele_idx] → list of haplotype indices that correspond to this allele in the given block.

Example:

reverse_maps[b][a] = { h0, h1, h7 }

means in block b, allele a corresponds to haplotypes 0, 1, and 7.

Parameters
[in]typed_only_reference_dataReference haplotype data partitioned into blocks, each containing allele cardinalities and a unique haplotype map.
Returns
A nested vector structure containing reverse maps for each block.

@complexity O(N), where N = total number of haplotype entries across all blocks.

Note
Each block in typed_only_reference_data must have consistent cardinalities() and unique_map() values.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ load_reference_haplotypes()

bool load_reference_haplotypes ( const std::string & file_path,
const savvy::genomic_region & extended_reg,
const savvy::genomic_region & impute_reg,
const std::unordered_set< std::string > & subset_ids,
std::vector< target_variant > & target_sites,
reduced_haplotypes & typed_only_reference_data,
reduced_haplotypes & full_reference_data,
genetic_map_file * map_file,
float min_recom,
float default_match_error )

Load and process reference haplotypes from an MVCF file.

This function loads reference haplotypes within an extended genomic region from an MVCF (M3VCFv3/MVCFv3) file. It extracts haplotype blocks, aligns them with the provided target variants, computes allele frequencies, recombination probabilities, and compresses the resulting haplotype data into reduced representations for imputation.

Parameters
file_pathPath to the MVCF reference haplotype file.
extended_regGenomic region to query from the reference file (with buffer for recombination).
impute_regGenomic region of interest for imputation (subset of extended_reg).
subset_idsSubset of sample IDs to extract from the reference file. If empty, all samples are included.
target_sitesVector of target variants to be aligned and updated with reference information (allele frequency, error rate, recombination probability, etc.).
typed_only_reference_dataOutput container for typed-only reference haplotypes, compressed.
full_reference_dataOutput container for full reference haplotype data across the impute region.
map_fileOptional genetic map file for interpolation of centimorgan positions. If provided, recombination probabilities are computed from map distances.
min_recomMinimum recombination probability to enforce between adjacent variants.
default_match_errorDefault genotype matching error rate used when missing in the reference file.
Returns
true if the haplotypes were successfully loaded and processed, false if an error occurred (e.g., file not found, wrong format, no overlapping samples).
Note
  • The reference file must be indexed MVCFv3.0/M3VCFv3.0 format.
  • If no overlapping subset samples are found, the function fails.
  • Variants outside the imputation region are trimmed, but recombination probabilities are still updated based on extended_reg.
  • For chromosome X, ensure PAR and non-PAR regions are imputed separately.
See also
reduced_haplotypes, target_variant, savvy::reader
Here is the call graph for this function:
Here is the caller graph for this function:

◆ load_reference_haplotypes_old_recom_approach()

bool load_reference_haplotypes_old_recom_approach ( const std::string & file_path,
const savvy::genomic_region & extended_reg,
const savvy::genomic_region & impute_reg,
const std::unordered_set< std::string > & subset_ids,
std::vector< target_variant > & target_sites,
reduced_haplotypes & typed_only_reference_data,
reduced_haplotypes & full_reference_data,
genetic_map_file * map_file )

Loads reference haplotypes using an older recombination-based approach.

This function reads reference haplotypes from an MVCF/M3VCF file and integrates them with a set of target variants. It populates two reduced haplotype structures: one containing only variants overlapping with the target sites (typed_only_reference_data), and one containing all reference haplotypes within the imputation region (full_reference_data).

Recombination probabilities between consecutive variants are estimated using either the centimorgan positions from a provided genetic map (map_file) or the reference file annotations. Allele frequencies for overlapping target variants are updated based on reference genotypes.

Parameters
file_pathPath to the reference haplotype file (MVCF/M3VCF format).
extended_regGenomic region specifying the extended window to load haplotypes from (includes buffer around imputation region).
impute_regGenomic region specifying the imputation window (used to trim full reference haplotype blocks).
subset_idsOptional subset of sample IDs to restrict reference samples. If empty, all samples are used.
target_sitesVector of target variants. This vector is updated with allele frequencies, reference overlap flags, and recombination estimates.
typed_only_reference_dataOutput reduced haplotypes structure containing only variants overlapping with target sites.
full_reference_dataOutput reduced haplotypes structure containing all haplotype blocks overlapping the imputation region.
map_fileOptional pointer to a genetic map file. If provided, used to interpolate centimorgan distances for recombination probability calculation.
Returns
True if the reference haplotypes were successfully loaded and processed, false otherwise.
Note
The function supports both newer MVCFv3 and older M3VCF file formats. When an invalid or incompatible file is provided, an error message is printed and the function returns false.
Warning
The function assumes that the reference file is properly indexed (MVCF) or formatted (M3VCF). Errors in input files will terminate early.
Here is the call graph for this function:

◆ load_target_haplotypes()

bool load_target_haplotypes ( const std::string & file_path,
const savvy::genomic_region & reg,
std::vector< target_variant > & target_sites,
std::vector< std::string > & sample_ids )

Load haplotypes from a target file for a given genomic region.

This function opens a VCF/BCF target file, extracts sample IDs, enforces ploidy consistency across all variants, and fills the list of target variants (target_sites) with genotypes encoded per allele.

Parameters
file_pathPath to the target VCF/BCF file.
Must be bgzipped and indexed (CSI/TBI).
regGenomic region to query (chromosome, start, end).
Bounds are applied using savvy::reader::reset_bounds().
If querying fails, the function returns false.
target_sitesOutput vector that will be filled with target_variant objects, each representing one ALT allele at a site.
For multi-allelic sites, one entry per ALT allele is created.
sample_idsOutput vector of sample IDs extracted from the target file header.
Returns
True if the haplotypes were successfully loaded and ploidy checks passed,
false otherwise. On failure, descriptive error messages are printed to stderr.
  • The function extracts genotypes (GT field) and converts them into binary encoding:
    • Biallelic sites: GT values are stored directly in target_variant::gt.
    • Multiallelic sites: For each ALT allele, a separate target_variant is created.
      The genotype array is recoded to indicate presence (1) or absence (0) of the allele.
  • Ploidy consistency is checked:
    • On first variant, sample ploidies are initialized.
    • On subsequent variants, check_ploidies() ensures all samples retain the same ploidy.
    • If a sample’s ploidy changes mid-file, the function errors out.
    • Special warning is printed for chromosome X (X or chrX) due to PAR/non-PAR handling.
  • For missing genotype values, quiet_NaN() is used as a placeholder for dosage fields.
Note
  • Requires that the input file is properly indexed. Otherwise, region queries will fail.
  • If the function encounters ploidy changes, the run is aborted to avoid downstream phasing/imputation errors.
  • End users should split chromosome X into PAR and non-PAR regions before imputation.
Warning
Error messages are printed directly to stderr.
This includes:
  • File access errors (cannot open target file).
  • Region query errors (file not indexed or invalid reg).
  • Ploidy inconsistency across samples.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ load_variant_hmm_params()

bool load_variant_hmm_params ( std::vector< target_variant > & tar_variants,
reduced_haplotypes & typed_only_reference_data,
float default_error_param,
float recom_min,
const std::string & map_file_path )

Loads Hidden Markov Model (HMM) parameters for target variants.

This function initializes error rates and recombination probabilities for a sequence of target variants by combining:

  • Default error parameters.
  • Per-variant error estimates (if available).
  • Genetic map–based recombination rates (if a genetic map file is provided).

Specifically:

  • Each variant’s error parameter (err) is set either from the reference data or from default_error_param if missing (NaN).
  • Recombination probabilities (recom) are computed between consecutive variants using genetic map centiMorgan (cM) positions, converted to switch probabilities. If no map file is provided, previously loaded map positions in the reference data are used.
  • The last variant is always assigned recom = 0.0f, ensuring no recombination at the backward traversal boundary.
Parameters
[in,out]tar_variantsVector of target variants whose HMM parameters (err, recom) will be filled.
Must have the same size as the reference haplotype set.
[in,out]typed_only_reference_dataReduced reference haplotype data aligned to tar_variants.
Provides genetic map positions and optional error rates.
[in]default_error_paramDefault error parameter assigned if no error rate is provided.
[in]recom_minMinimum recombination probability allowed between adjacent variants.
[in]map_file_pathPath to the genetic map file. If empty, recombination rates are computed from existing positions in typed_only_reference_data.
Returns
true if parameters were successfully loaded;
false if the variant list is empty or if the genetic map file cannot be opened.
Note
The function asserts that the number of target variants matches the number of reference variants.
Input vectors are modified in place.

@complexity O(N), where N = number of variants.

Here is the call graph for this function:

◆ separate_target_only_variants()

std::vector< target_variant > separate_target_only_variants ( std::vector< target_variant > & target_sites)

Separates target-only variants from those found in the reference panel.

This function partitions the input vector of target variants (target_sites) into two groups:

  • Variants that are present in the reference (remain in target_sites).
  • Variants that are absent from the reference (in_ref == false) and are moved into a new vector (target_only_sites).

The function maintains efficient memory usage by swapping elements in place rather than performing deep copies. After execution, target_sites will only contain variants found in the reference panel, and target_only_sites will contain all others.

Parameters
[in,out]target_sitesVector of target variants to be partitioned. After the call, it will contain only reference-matching variants.
Returns
std::vector<target_variant>
A vector containing all target-only variants (not present in the reference).
Note
This function modifies the input vector target_sites by resizing it.
Order of elements may change due to the use of std::swap.

@complexity O(N), where N = number of target variants.

Here is the caller graph for this function:

◆ stat_ref_panel()

bool stat_ref_panel ( const std::string & ref_file_path,
std::string & chrom,
std::uint64_t & end_pos )

Inspect a reference panel file to determine chromosome and end position.

This function checks the index files associated with a reference panel (S1R, CSI, or TBI) to determine the contig (chromosome) and maximum position.
It supports both .s1r index statistics and VCF/BCF headers when CSI/TBI indexes are present.

Parameters
ref_file_pathPath to the reference panel file (VCF/BCF/MVCF).
chromChromosome name. If empty, it will be set automatically.
If non-empty, the function verifies that the reference file contains it.
end_posEnd position of the region. Updated to the minimum of its current value and the chromosome length / max position found in the index/header.
Returns
True if chromosome and end position were successfully determined,
false if an error occurred (e.g., multiple chromosomes present, index missing, or inconsistent contig name).
Note
  • If .s1r index statistics are available, they take priority.
  • If multiple contigs are present and chrom is empty, the function fails and suggests using --region.
  • If .csi or .tbi index is found, chromosome information is read from the VCF/BCF header.
  • For MVCF files, the reference must be indexed. Legacy M3VCF files need conversion with:
    minimac4 --update-m3vcf input.m3vcf.gz > output.msav
Warning
Prints descriptive error messages to stderr if reference panel validation fails.
Here is the caller graph for this function:

◆ stat_tar_panel()

bool stat_tar_panel ( const std::string & tar_file_path,
std::vector< std::string > & sample_ids )

Extract sample IDs from a target panel file.

Opens a target haplotype/genotype file using savvy::reader and retrieves the list of sample IDs contained in the file header.

Parameters
tar_file_pathPath to the target panel file.
sample_idsReference to a vector that will be filled with the sample IDs read from the file.
Returns
True if the file was successfully opened and sample IDs were retrieved, false if the file could not be opened.
Note
  • The function prints an error message to stderr if the file cannot be opened.
  • The existing contents of sample_ids are overwritten.
Here is the caller graph for this function: