Minimac4
|
Represents a block of unique haplotypes and their variants. More...
#include <unique_haplotype.hpp>
Public Member Functions | |
bool | compress_variant (const reference_site_info &site_info, const std::vector< std::int8_t > &alleles) |
Compress and map haplotype alleles for a new variant into the block. | |
const std::vector< reference_variant > & | variants () const |
Get the list of compressed reference variants in this block. | |
const std::vector< std::int64_t > & | unique_map () const |
Get the mapping from haplotypes to unique columns. | |
std::size_t | expanded_haplotype_size () const |
Get the number of original haplotypes after expansion. | |
std::size_t | unique_haplotype_size () const |
Get the number of unique haplotypes represented in the block. | |
std::size_t | variant_size () const |
Get the number of variants compressed in this block. | |
const std::vector< std::size_t > & | cardinalities () const |
Get the haplotype cardinalities across unique columns. | |
void | clear () |
Clears all data stored in the haplotype block. | |
void | trim (std::size_t min_pos, std::size_t max_pos) |
Trims variants outside a specified genomic range. | |
void | pop_variant () |
Removes the last variant from the haplotype block. | |
void | fill_cm (genetic_map_file &map_file) |
Fills the centimorgan (cM) values for all variants in the haplotype block. | |
void | fill_cm_from_recom (double &start_cm) |
Fills missing centimorgan (cM) values for variants using recombination probabilities. | |
bool | deserialize (std::istream &is, int m3vcf_version, std::size_t n_haplotypes) |
Deserialize a unique haplotype block from an input stream (m3vcf format). | |
int | deserialize (savvy::reader &input_file, savvy::variant &var) |
Deserializes a unique haplotype block from a SAVVY input file and variant. | |
bool | serialize (savvy::writer &output_file) |
Serializes the unique haplotype block to a SAVVY output file. | |
void | remove_eov () |
Removes "end-of-vector" markers from the unique haplotype map. | |
Represents a block of unique haplotypes and their variants.
This class stores haplotype information in a compressed form, mapping individual haplotypes to unique columns of alleles, tracking allele counts (cardinalities), and storing variant details.
|
inline |
Get the haplotype cardinalities across unique columns.
Each entry represents how many haplotypes map to the corresponding unique haplotype column.
void unique_haplotype_block::clear | ( | ) |
Clears all data stored in the haplotype block.
This method resets the internal state of the haplotype block by removing all stored variants, haplotype mappings, and cardinality counts.
Specifically:
variants_
is cleared, removing all variant records.unique_map_
is cleared, removing all haplotype mapping indices.cardinalities_
is cleared, resetting the frequency counts of haplotypes.After calling this method, the haplotype block is in an empty state and must be repopulated (e.g., via deserialize()
).
bool unique_haplotype_block::compress_variant | ( | const reference_site_info & | site_info, |
const std::vector< std::int8_t > & | alleles ) |
Compress and map haplotype alleles for a new variant into the block.
This function updates the block of unique haplotypes by incorporating a new variant, compressing redundant haplotypes, and maintaining a mapping from input alleles to unique haplotype indices.
The first call initializes the block with the given variant. Subsequent calls attempt to map the alleles to existing haplotype columns or create new ones if necessary.
site_info | Reference site metadata (chromosome, position, ID, alleles, error, recombination rate, cM). |
alleles | Vector of observed alleles (per haplotype). |
unique_map_
is built from haplotype indices to allele indices.cardinalities_
tracks the number of haplotypes assigned per column.unique_map_
, otherwise the function fails.unique_map_
.savvy::typed_value::is_end_of_vector()
is used to mark missing/invalid alleles.cardinalities_
is updated to reflect haplotype counts per column.assert()
checks to verify internal consistency.false
if:Example use:
int unique_haplotype_block::deserialize | ( | savvy::reader & | input_file, |
savvy::variant & | var ) |
Deserializes a unique haplotype block from a SAVVY input file and variant.
This function reads haplotype block data from the given input_file
using the var
object. It populates the variants_
, unique_map_
, and cardinalities_
of the unique_haplotype_block
. The function stops reading when a "<BLOCK>" alt allele is encountered or the input ends.
input_file | Reference to a savvy::reader object representing the input file. |
var | Reference to a savvy::variant object used to read variant data. |
-1
on I/O errors or invalid data,0
if input is empty or not ready,n+1
where n
is the number of variants successfully deserialized.gt
vector of each variant is expected to match the size of cardinalities_
. The allele count ac
is computed using the cardinalities_
. The first "<BLOCK>" allele encountered marks the end of this haplotype block. bool unique_haplotype_block::deserialize | ( | std::istream & | is, |
int | m3vcf_version, | ||
std::size_t | n_haplotypes ) |
Deserialize a unique haplotype block from an input stream (m3vcf format).
This function reads and parses a block of data from a multiple-phased VCF (m3vcf) file into the unique_haplotype_block
object. It populates internal structures such as the unique haplotype map, variant information, and cardinalities.
[in,out] | is | Input stream containing m3vcf block data. Stream state is updated depending on success or error. |
[in] | m3vcf_version | Format version of the m3vcf file (1 or 2). |
[in] | n_haplotypes | Total number of haplotypes expected in this block. |
true
if deserialization was successful, false
otherwise.std::ios::badbit
, and writes an error message to std::cerr
.VARIANTS=<N>
→ number of variants in the block.REPS=<M>
→ number of unique haplotype representatives.unique_map_
) from genotype columns:n_haplotypes
.cardinalities_
vector (frequency of each unique haplotype).n_variants
subsequent lines, parsing:ERR
and RECOM
values).cols[8]
).n_reps
.ac
) computed using cardinalities_
.n_haplotypes
.n_reps
.In any of these cases, the function:
clear()
).is.setstate(is.rdstate() | std::ios::badbit)
).std::cerr
.
|
inline |
Get the number of original haplotypes after expansion.
This corresponds to the length of the unique_map_
vector.
void unique_haplotype_block::fill_cm | ( | genetic_map_file & | map_file | ) |
Fills the centimorgan (cM) values for all variants in the haplotype block.
This function iterates over all variants in the variants_
vector and sets each variant's cm
field by interpolating its genetic position using the provided genetic_map_file
.
map_file | Reference to a genetic_map_file object used for interpolation. |
void unique_haplotype_block::fill_cm_from_recom | ( | double & | start_cm | ) |
Fills missing centimorgan (cM) values for variants using recombination probabilities.
This function iterates over all variants in the variants_
vector. For any variant with a NaN cm
value, it sets cm
to the provided start_cm
. If the variant has a valid recom
value, start_cm
is incremented using the recombination probability converted to cM via recombination::switch_prob_to_cm()
.
start_cm | Reference to a double representing the starting centimorgan value. This value will be updated as variants are processed. |
void unique_haplotype_block::pop_variant | ( | ) |
Removes the last variant from the haplotype block.
This function erases the most recently added variant in the variants_
vector.
variants_
container. It does not update unique_map_
or cardinalities_
, so use with caution if the block has been compressed or mapped. void unique_haplotype_block::remove_eov | ( | ) |
Removes "end-of-vector" markers from the unique haplotype map.
This function scans the unique_map_
vector and removes any elements that represent the end-of-vector (EOV) sentinel value used to indicate missing or invalid haplotypes.
After this operation, the unique_map_
will contain only valid haplotype indices, and its size is reduced by the number of EOV entries removed.
bool unique_haplotype_block::serialize | ( | savvy::writer & | output_file | ) |
Serializes the unique haplotype block to a SAVVY output file.
This function writes the current haplotype block, including its variants, unique haplotype mapping, and cardinalities, to the specified output_file
. The first variant in the block is treated as a marker with "<BLOCK>" alt allele. Each variant stores allele counts (AC/AN), error probabilities, recombination rates, centimorgan positions, and haplotype genotypes (UHA).
output_file | Reference to a savvy::writer object representing the output file. |
true
if serialization succeeded, false
if the block is empty or on I/O errors.output_file
to align zstd compression blocks with m3vcf haplotype blocks. void unique_haplotype_block::trim | ( | std::size_t | min_pos, |
std::size_t | max_pos ) |
Trims variants outside a specified genomic range.
This function removes variants from the haplotype block that fall outside the interval [min_pos, max_pos]
.
Behavior:
min_pos
or after max_pos
are erased, while the remaining variants are preserved.min_pos | The minimum genomic position (inclusive). |
max_pos | The maximum genomic position (inclusive). |
variants_
container. The mappings (unique_map_
) and cardinalities remain unchanged, so care must be taken when trimming after compression.
|
inline |
Get the number of unique haplotypes represented in the block.
This corresponds to the number of columns in the genotype matrix. If the block is empty, returns 0.
|
inline |
Get the mapping from haplotypes to unique columns.
Each entry corresponds to an original haplotype index and indicates which compressed column it maps to.
|
inline |
Get the number of variants compressed in this block.
|
inline |
Get the list of compressed reference variants in this block.