Minimac4
|
Go to the source code of this file.
Functions | |
bool | stat_tar_panel (const std::string &tar_file_path, std::vector< std::string > &sample_ids) |
Extract sample IDs from a target panel file. | |
bool | stat_ref_panel (const std::string &ref_file_path, std::string &chrom, std::uint64_t &end_pos) |
Inspect a reference panel file to determine chromosome and end position. | |
bool | load_target_haplotypes (const std::string &file_path, const savvy::genomic_region ®, std::vector< target_variant > &target_sites, std::vector< std::string > &sample_ids) |
Load haplotypes from a target file for a given genomic region. | |
bool | load_reference_haplotypes (const std::string &file_path, const savvy::genomic_region &extended_reg, const savvy::genomic_region &impute_reg, const std::unordered_set< std::string > &subset_ids, std::vector< target_variant > &target_sites, reduced_haplotypes &typed_only_reference_data, reduced_haplotypes &full_reference_data, genetic_map_file *map_file, float min_recom, float default_match_error) |
Load and process reference haplotypes from an MVCF file. | |
bool | load_reference_haplotypes_old_recom_approach (const std::string &file_path, const savvy::genomic_region &extended_reg, const savvy::genomic_region &impute_reg, const std::unordered_set< std::string > &subset_ids, std::vector< target_variant > &target_sites, reduced_haplotypes &typed_only_reference_data, reduced_haplotypes &full_reference_data, genetic_map_file *map_file) |
Loads reference haplotypes using an older recombination-based approach. | |
std::vector< target_variant > | separate_target_only_variants (std::vector< target_variant > &target_sites) |
Separates target-only variants from those found in the reference panel. | |
bool | load_variant_hmm_params (std::vector< target_variant > &tar_variants, reduced_haplotypes &typed_only_reference_data, float default_error_param, float recom_min, const std::string &map_file_path) |
Loads Hidden Markov Model (HMM) parameters for target variants. | |
std::vector< std::vector< std::vector< std::size_t > > > | generate_reverse_maps (const reduced_haplotypes &typed_only_reference_data) |
Generates reverse mapping tables for reduced haplotype blocks. | |
bool | convert_old_m3vcf (const std::string &input_path, const std::string &output_path, const std::string &map_file_path="") |
Converts an old M3VCF file (v1/v2) to a newer VCF-like format (MVCFv3.0). | |
bool | compress_reference_panel (const std::string &input_path, const std::string &output_path, std::size_t min_block_size=10, std::size_t max_block_size=0xFFFF, std::size_t slope_unit=10, const std::string &map_file_path="") |
Compress a haplotype reference panel into blocks and write to an output file. | |
bool compress_reference_panel | ( | const std::string & | input_path, |
const std::string & | output_path, | ||
std::size_t | min_block_size = 10, | ||
std::size_t | max_block_size = 0xFFFF, | ||
std::size_t | slope_unit = 10, | ||
const std::string & | map_file_path = "" ) |
Compress a haplotype reference panel into blocks and write to an output file.
This function reads a phased reference panel from an input file (VCF/BCF/SAV format), compresses haplotypes into blocks of unique haplotypes, and writes the result to an output file in SAV/BCF format. It adaptively determines when to flush blocks based on compression ratio and block size constraints.
input_path | Path to the input reference panel file (VCF/BCF/SAV). |
output_path | Path to the compressed output file (SAV/BCF). |
min_block_size | Minimum number of variants required in a block before flushing. |
max_block_size | Maximum number of variants allowed in a block before forcing flush. |
slope_unit | Interval of variants used to check compression ratio slope. |
map_file_path | Path to a genetic map file (currently unused, reserved for CM filling). |
\[ CR = \frac{\text{expanded haplotype size} + (\text{unique haplotype size} \times \text{variant size})} {\text{expanded haplotype size} \times \text{variant size}} \]
bool convert_old_m3vcf | ( | const std::string & | input_path, |
const std::string & | output_path, | ||
const std::string & | map_file_path = "" ) |
Converts an old M3VCF file (v1/v2) to a newer VCF-like format (MVCFv3.0).
This function reads an input M3VCF file, updates its headers and records, and writes them to an output file in a modernized format compatible with downstream tools. Optionally, a genetic map file may be provided to annotate recombination positions.
Conversion steps include:
phasing
and contig
headers.[in] | input_path | Path to the old M3VCF file (gzipped). |
[in] | output_path | Path to the output file (can be .bcf or .sav ). |
[in] | map_file_path | Optional path to a genetic map file for cM annotation. |
std::runtime_error | If input or output files cannot be opened. |
map_file_path
is non-empty, records will include cM positions.phasing
and contig
headers exist.@complexity
std::vector< std::vector< std::vector< std::size_t > > > generate_reverse_maps | ( | const reduced_haplotypes & | typed_only_reference_data | ) |
Generates reverse mapping tables for reduced haplotype blocks.
This function constructs a three-level nested vector structure (reverse_maps
) that provides, for each block and each allele state in that block, the list of haplotype indices that map to it.
The mapping is inverted from the unique_map()
representation in each block of the typed_only_reference_data
.
Structure of the returned vector:
reverse_maps[block_idx][allele_idx]
→ list of haplotype indices that correspond to this allele in the given block.Example:
means in block b
, allele a
corresponds to haplotypes 0, 1, and 7.
[in] | typed_only_reference_data | Reference haplotype data partitioned into blocks, each containing allele cardinalities and a unique haplotype map. |
@complexity O(N), where N = total number of haplotype entries across all blocks.
typed_only_reference_data
must have consistent cardinalities()
and unique_map()
values. bool load_reference_haplotypes | ( | const std::string & | file_path, |
const savvy::genomic_region & | extended_reg, | ||
const savvy::genomic_region & | impute_reg, | ||
const std::unordered_set< std::string > & | subset_ids, | ||
std::vector< target_variant > & | target_sites, | ||
reduced_haplotypes & | typed_only_reference_data, | ||
reduced_haplotypes & | full_reference_data, | ||
genetic_map_file * | map_file, | ||
float | min_recom, | ||
float | default_match_error ) |
Load and process reference haplotypes from an MVCF file.
This function loads reference haplotypes within an extended genomic region from an MVCF (M3VCFv3/MVCFv3) file. It extracts haplotype blocks, aligns them with the provided target variants, computes allele frequencies, recombination probabilities, and compresses the resulting haplotype data into reduced representations for imputation.
file_path | Path to the MVCF reference haplotype file. |
extended_reg | Genomic region to query from the reference file (with buffer for recombination). |
impute_reg | Genomic region of interest for imputation (subset of extended_reg ). |
subset_ids | Subset of sample IDs to extract from the reference file. If empty, all samples are included. |
target_sites | Vector of target variants to be aligned and updated with reference information (allele frequency, error rate, recombination probability, etc.). |
typed_only_reference_data | Output container for typed-only reference haplotypes, compressed. |
full_reference_data | Output container for full reference haplotype data across the impute region. |
map_file | Optional genetic map file for interpolation of centimorgan positions. If provided, recombination probabilities are computed from map distances. |
min_recom | Minimum recombination probability to enforce between adjacent variants. |
default_match_error | Default genotype matching error rate used when missing in the reference file. |
extended_reg
.bool load_reference_haplotypes_old_recom_approach | ( | const std::string & | file_path, |
const savvy::genomic_region & | extended_reg, | ||
const savvy::genomic_region & | impute_reg, | ||
const std::unordered_set< std::string > & | subset_ids, | ||
std::vector< target_variant > & | target_sites, | ||
reduced_haplotypes & | typed_only_reference_data, | ||
reduced_haplotypes & | full_reference_data, | ||
genetic_map_file * | map_file ) |
Loads reference haplotypes using an older recombination-based approach.
This function reads reference haplotypes from an MVCF/M3VCF file and integrates them with a set of target variants. It populates two reduced haplotype structures: one containing only variants overlapping with the target sites (typed_only_reference_data
), and one containing all reference haplotypes within the imputation region (full_reference_data
).
Recombination probabilities between consecutive variants are estimated using either the centimorgan positions from a provided genetic map (map_file
) or the reference file annotations. Allele frequencies for overlapping target variants are updated based on reference genotypes.
file_path | Path to the reference haplotype file (MVCF/M3VCF format). |
extended_reg | Genomic region specifying the extended window to load haplotypes from (includes buffer around imputation region). |
impute_reg | Genomic region specifying the imputation window (used to trim full reference haplotype blocks). |
subset_ids | Optional subset of sample IDs to restrict reference samples. If empty, all samples are used. |
target_sites | Vector of target variants. This vector is updated with allele frequencies, reference overlap flags, and recombination estimates. |
typed_only_reference_data | Output reduced haplotypes structure containing only variants overlapping with target sites. |
full_reference_data | Output reduced haplotypes structure containing all haplotype blocks overlapping the imputation region. |
map_file | Optional pointer to a genetic map file. If provided, used to interpolate centimorgan distances for recombination probability calculation. |
bool load_target_haplotypes | ( | const std::string & | file_path, |
const savvy::genomic_region & | reg, | ||
std::vector< target_variant > & | target_sites, | ||
std::vector< std::string > & | sample_ids ) |
Load haplotypes from a target file for a given genomic region.
This function opens a VCF/BCF target file, extracts sample IDs, enforces ploidy consistency across all variants, and fills the list of target variants (target_sites
) with genotypes encoded per allele.
file_path | Path to the target VCF/BCF file. Must be bgzipped and indexed (CSI/TBI). |
reg | Genomic region to query (chromosome, start, end). Bounds are applied using savvy::reader::reset_bounds() . If querying fails, the function returns false. |
target_sites | Output vector that will be filled with target_variant objects, each representing one ALT allele at a site. For multi-allelic sites, one entry per ALT allele is created. |
sample_ids | Output vector of sample IDs extracted from the target file header. |
stderr
.GT
field) and converts them into binary encoding:target_variant::gt
.target_variant
is created. check_ploidies()
ensures all samples retain the same ploidy.X
or chrX
) due to PAR/non-PAR handling.quiet_NaN()
is used as a placeholder for dosage fields.stderr
. reg
). bool load_variant_hmm_params | ( | std::vector< target_variant > & | tar_variants, |
reduced_haplotypes & | typed_only_reference_data, | ||
float | default_error_param, | ||
float | recom_min, | ||
const std::string & | map_file_path ) |
Loads Hidden Markov Model (HMM) parameters for target variants.
This function initializes error rates and recombination probabilities for a sequence of target variants by combining:
Specifically:
err
) is set either from the reference data or from default_error_param
if missing (NaN).recom
) are computed between consecutive variants using genetic map centiMorgan (cM) positions, converted to switch probabilities. If no map file is provided, previously loaded map positions in the reference data are used.recom = 0.0f
, ensuring no recombination at the backward traversal boundary.[in,out] | tar_variants | Vector of target variants whose HMM parameters (err , recom ) will be filled. Must have the same size as the reference haplotype set. |
[in,out] | typed_only_reference_data | Reduced reference haplotype data aligned to tar_variants . Provides genetic map positions and optional error rates. |
[in] | default_error_param | Default error parameter assigned if no error rate is provided. |
[in] | recom_min | Minimum recombination probability allowed between adjacent variants. |
[in] | map_file_path | Path to the genetic map file. If empty, recombination rates are computed from existing positions in typed_only_reference_data . |
true
if parameters were successfully loaded; false
if the variant list is empty or if the genetic map file cannot be opened.@complexity O(N), where N = number of variants.
std::vector< target_variant > separate_target_only_variants | ( | std::vector< target_variant > & | target_sites | ) |
Separates target-only variants from those found in the reference panel.
This function partitions the input vector of target variants (target_sites
) into two groups:
target_sites
).in_ref == false
) and are moved into a new vector (target_only_sites
).The function maintains efficient memory usage by swapping elements in place rather than performing deep copies. After execution, target_sites
will only contain variants found in the reference panel, and target_only_sites
will contain all others.
[in,out] | target_sites | Vector of target variants to be partitioned. After the call, it will contain only reference-matching variants. |
target_sites
by resizing it. std::swap
.@complexity O(N), where N = number of target variants.
bool stat_ref_panel | ( | const std::string & | ref_file_path, |
std::string & | chrom, | ||
std::uint64_t & | end_pos ) |
Inspect a reference panel file to determine chromosome and end position.
This function checks the index files associated with a reference panel (S1R, CSI, or TBI) to determine the contig (chromosome) and maximum position.
It supports both .s1r
index statistics and VCF/BCF headers when CSI/TBI indexes are present.
ref_file_path | Path to the reference panel file (VCF/BCF/MVCF). |
chrom | Chromosome name. If empty, it will be set automatically. If non-empty, the function verifies that the reference file contains it. |
end_pos | End position of the region. Updated to the minimum of its current value and the chromosome length / max position found in the index/header. |
.s1r
index statistics are available, they take priority.chrom
is empty, the function fails and suggests using --region
..csi
or .tbi
index is found, chromosome information is read from the VCF/BCF header.stderr
if reference panel validation fails.bool stat_tar_panel | ( | const std::string & | tar_file_path, |
std::vector< std::string > & | sample_ids ) |
Extract sample IDs from a target panel file.
Opens a target haplotype/genotype file using savvy::reader
and retrieves the list of sample IDs contained in the file header.
tar_file_path | Path to the target panel file. |
sample_ids | Reference to a vector that will be filled with the sample IDs read from the file. |
stderr
if the file cannot be opened.sample_ids
are overwritten.