Minimac4
Loading...
Searching...
No Matches
unique_haplotype_block Class Reference

Represents a block of unique haplotypes and their variants. More...

#include <unique_haplotype.hpp>

Public Member Functions

bool compress_variant (const reference_site_info &site_info, const std::vector< std::int8_t > &alleles)
 Compress and map haplotype alleles for a new variant into the block.
 
const std::vector< reference_variant > & variants () const
 Get the list of compressed reference variants in this block.
 
const std::vector< std::int64_t > & unique_map () const
 Get the mapping from haplotypes to unique columns.
 
std::size_t expanded_haplotype_size () const
 Get the number of original haplotypes after expansion.
 
std::size_t unique_haplotype_size () const
 Get the number of unique haplotypes represented in the block.
 
std::size_t variant_size () const
 Get the number of variants compressed in this block.
 
const std::vector< std::size_t > & cardinalities () const
 Get the haplotype cardinalities across unique columns.
 
void clear ()
 Clears all data stored in the haplotype block.
 
void trim (std::size_t min_pos, std::size_t max_pos)
 Trims variants outside a specified genomic range.
 
void pop_variant ()
 Removes the last variant from the haplotype block.
 
void fill_cm (genetic_map_file &map_file)
 Fills the centimorgan (cM) values for all variants in the haplotype block.
 
void fill_cm_from_recom (double &start_cm)
 Fills missing centimorgan (cM) values for variants using recombination probabilities.
 
bool deserialize (std::istream &is, int m3vcf_version, std::size_t n_haplotypes)
 Deserialize a unique haplotype block from an input stream (m3vcf format).
 
int deserialize (savvy::reader &input_file, savvy::variant &var)
 Deserializes a unique haplotype block from a SAVVY input file and variant.
 
bool serialize (savvy::writer &output_file)
 Serializes the unique haplotype block to a SAVVY output file.
 
void remove_eov ()
 Removes "end-of-vector" markers from the unique haplotype map.
 

Detailed Description

Represents a block of unique haplotypes and their variants.

This class stores haplotype information in a compressed form, mapping individual haplotypes to unique columns of alleles, tracking allele counts (cardinalities), and storing variant details.

Member Function Documentation

◆ cardinalities()

const std::vector< std::size_t > & unique_haplotype_block::cardinalities ( ) const
inline

Get the haplotype cardinalities across unique columns.

Each entry represents how many haplotypes map to the corresponding unique haplotype column.

Returns
Const reference to the cardinalities vector.
Here is the caller graph for this function:

◆ clear()

void unique_haplotype_block::clear ( )

Clears all data stored in the haplotype block.

This method resets the internal state of the haplotype block by removing all stored variants, haplotype mappings, and cardinality counts.

Specifically:

  • variants_ is cleared, removing all variant records.
  • unique_map_ is cleared, removing all haplotype mapping indices.
  • cardinalities_ is cleared, resetting the frequency counts of haplotypes.

After calling this method, the haplotype block is in an empty state and must be repopulated (e.g., via deserialize()).

Here is the caller graph for this function:

◆ compress_variant()

bool unique_haplotype_block::compress_variant ( const reference_site_info & site_info,
const std::vector< std::int8_t > & alleles )

Compress and map haplotype alleles for a new variant into the block.

This function updates the block of unique haplotypes by incorporating a new variant, compressing redundant haplotypes, and maintaining a mapping from input alleles to unique haplotype indices.

The first call initializes the block with the given variant. Subsequent calls attempt to map the alleles to existing haplotype columns or create new ones if necessary.

Parameters
site_infoReference site metadata (chromosome, position, ID, alleles, error, recombination rate, cM).
allelesVector of observed alleles (per haplotype).
Returns
true if compression and mapping succeeded, false otherwise.
  • If this is the first variant, a unique mapping is initialized:
    • Each unique allele is assigned a haplotype column.
    • A unique_map_ is built from haplotype indices to allele indices.
    • cardinalities_ tracks the number of haplotypes assigned per column.
  • For subsequent variants:
    • Alleles must match the size of unique_map_, otherwise the function fails.
    • Each haplotype is mapped back to its column via unique_map_.
    • If the allele matches the expected value, it is stored directly.
    • If it mismatches:
      • Check if it matches a newly created column.
      • If not, a new haplotype column is created and propagated across all variants.
  • Special handling:
    • savvy::typed_value::is_end_of_vector() is used to mark missing/invalid alleles.
    • Ploidy mismatches are detected and reported as errors.
    • cardinalities_ is updated to reflect haplotype counts per column.
Note
  • Ensures that the total haplotype count matches across all variants.
  • Uses assert() checks to verify internal consistency.
  • Returns false if:
    • The alleles vector is empty.
    • A sample ploidy mismatch is detected.
    • The haplotype size differs from the expected mapping.

Example use:

std::vector<std::int8_t> alleles = {0, 1, 0, 1};
bool ok = block.compress_variant(site, alleles);
Represents a block of unique haplotypes and their variants.
Definition unique_haplotype.hpp:25
bool compress_variant(const reference_site_info &site_info, const std::vector< std::int8_t > &alleles)
Compress and map haplotype alleles for a new variant into the block.
Definition unique_haplotype.cpp:10
Stores information about a site in the reference dataset.
Definition variant.hpp:40
Here is the caller graph for this function:

◆ deserialize() [1/2]

int unique_haplotype_block::deserialize ( savvy::reader & input_file,
savvy::variant & var )

Deserializes a unique haplotype block from a SAVVY input file and variant.

This function reads haplotype block data from the given input_file using the var object. It populates the variants_, unique_map_, and cardinalities_ of the unique_haplotype_block. The function stops reading when a "<BLOCK>" alt allele is encountered or the input ends.

Parameters
input_fileReference to a savvy::reader object representing the input file.
varReference to a savvy::variant object used to read variant data.
Returns
int Returns:
  • -1 on I/O errors or invalid data,
  • 0 if input is empty or not ready,
  • n+1 where n is the number of variants successfully deserialized.
Note
The gt vector of each variant is expected to match the size of cardinalities_. The allele count ac is computed using the cardinalities_. The first "<BLOCK>" allele encountered marks the end of this haplotype block.
Here is the call graph for this function:

◆ deserialize() [2/2]

bool unique_haplotype_block::deserialize ( std::istream & is,
int m3vcf_version,
std::size_t n_haplotypes )

Deserialize a unique haplotype block from an input stream (m3vcf format).

This function reads and parses a block of data from a multiple-phased VCF (m3vcf) file into the unique_haplotype_block object. It populates internal structures such as the unique haplotype map, variant information, and cardinalities.

Parameters
[in,out]isInput stream containing m3vcf block data. Stream state is updated depending on success or error.
[in]m3vcf_versionFormat version of the m3vcf file (1 or 2).
[in]n_haplotypesTotal number of haplotypes expected in this block.
Returns
true if deserialization was successful, false otherwise.
Note
On failure, the method clears all internal data structures, marks the stream with std::ios::badbit, and writes an error message to std::cerr.

Parsing Steps

  1. Read the header line of the block, extract metadata:
    • VARIANTS=<N> → number of variants in the block.
    • REPS=<M> → number of unique haplotype representatives.
  2. Parse haplotype indices (unique_map_) from genotype columns:
    • For version 1: one haplotype index per column.
    • For version 2: paired haplotype indices separated by '|'.
  3. Validate that the number of parsed haplotypes matches n_haplotypes.
  4. Build the cardinalities_ vector (frequency of each unique haplotype).
  5. Read n_variants subsequent lines, parsing:
    • Chromosome, position, ID, reference/alternate alleles.
    • INFO fields (extracts ERR and RECOM values).
    • Genotypes:
      • For version 2: encoded as run-length style offsets (cols[8]).
      • For version 1: raw 0/1 vector of length n_reps.
    • Allele counts (ac) computed using cardinalities_.

Error Conditions

  • Invalid format (wrong number of columns, invalid integer conversions, etc.).
  • Number of haplotypes read does not match n_haplotypes.
  • Genotype column inconsistent with expected n_reps.
  • Truncated input (fewer than expected variant lines).

In any of these cases, the function:

  • Clears internal state (clear()).
  • Marks the stream as bad (is.setstate(is.rdstate() | std::ios::badbit)).
  • Prints an error message to std::cerr.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ expanded_haplotype_size()

std::size_t unique_haplotype_block::expanded_haplotype_size ( ) const
inline

Get the number of original haplotypes after expansion.

This corresponds to the length of the unique_map_ vector.

Returns
Expanded haplotype count.
Here is the caller graph for this function:

◆ fill_cm()

void unique_haplotype_block::fill_cm ( genetic_map_file & map_file)

Fills the centimorgan (cM) values for all variants in the haplotype block.

This function iterates over all variants in the variants_ vector and sets each variant's cm field by interpolating its genetic position using the provided genetic_map_file.

Parameters
map_fileReference to a genetic_map_file object used for interpolation.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ fill_cm_from_recom()

void unique_haplotype_block::fill_cm_from_recom ( double & start_cm)

Fills missing centimorgan (cM) values for variants using recombination probabilities.

This function iterates over all variants in the variants_ vector. For any variant with a NaN cm value, it sets cm to the provided start_cm. If the variant has a valid recom value, start_cm is incremented using the recombination probability converted to cM via recombination::switch_prob_to_cm().

Parameters
start_cmReference to a double representing the starting centimorgan value. This value will be updated as variants are processed.
Here is the call graph for this function:

◆ pop_variant()

void unique_haplotype_block::pop_variant ( )

Removes the last variant from the haplotype block.

This function erases the most recently added variant in the variants_ vector.

Note
This only modifies the variants_ container. It does not update unique_map_ or cardinalities_, so use with caution if the block has been compressed or mapped.

◆ remove_eov()

void unique_haplotype_block::remove_eov ( )

Removes "end-of-vector" markers from the unique haplotype map.

This function scans the unique_map_ vector and removes any elements that represent the end-of-vector (EOV) sentinel value used to indicate missing or invalid haplotypes.

After this operation, the unique_map_ will contain only valid haplotype indices, and its size is reduced by the number of EOV entries removed.

Here is the caller graph for this function:

◆ serialize()

bool unique_haplotype_block::serialize ( savvy::writer & output_file)

Serializes the unique haplotype block to a SAVVY output file.

This function writes the current haplotype block, including its variants, unique haplotype mapping, and cardinalities, to the specified output_file. The first variant in the block is treated as a marker with "<BLOCK>" alt allele. Each variant stores allele counts (AC/AN), error probabilities, recombination rates, centimorgan positions, and haplotype genotypes (UHA).

Parameters
output_fileReference to a savvy::writer object representing the output file.
Returns
bool Returns true if serialization succeeded, false if the block is empty or on I/O errors.
Note
The function sets the block size in output_file to align zstd compression blocks with m3vcf haplotype blocks.
Here is the caller graph for this function:

◆ trim()

void unique_haplotype_block::trim ( std::size_t min_pos,
std::size_t max_pos )

Trims variants outside a specified genomic range.

This function removes variants from the haplotype block that fall outside the interval [min_pos, max_pos].

Behavior:

  • If all variants are outside the range, the block is cleared entirely.
  • Otherwise, variants before min_pos or after max_pos are erased, while the remaining variants are preserved.
Parameters
min_posThe minimum genomic position (inclusive).
max_posThe maximum genomic position (inclusive).
Note
This function only modifies the variants_ container. The mappings (unique_map_) and cardinalities remain unchanged, so care must be taken when trimming after compression.
Here is the call graph for this function:
Here is the caller graph for this function:

◆ unique_haplotype_size()

std::size_t unique_haplotype_block::unique_haplotype_size ( ) const
inline

Get the number of unique haplotypes represented in the block.

This corresponds to the number of columns in the genotype matrix. If the block is empty, returns 0.

Returns
Unique haplotype count.
Here is the caller graph for this function:

◆ unique_map()

const std::vector< std::int64_t > & unique_haplotype_block::unique_map ( ) const
inline

Get the mapping from haplotypes to unique columns.

Each entry corresponds to an original haplotype index and indicates which compressed column it maps to.

Returns
Const reference to the unique haplotype mapping vector.
Here is the caller graph for this function:

◆ variant_size()

std::size_t unique_haplotype_block::variant_size ( ) const
inline

Get the number of variants compressed in this block.

Returns
Variant count.
Here is the caller graph for this function:

◆ variants()

const std::vector< reference_variant > & unique_haplotype_block::variants ( ) const
inline

Get the list of compressed reference variants in this block.

Returns
Const reference to the vector of reference variants.
Here is the caller graph for this function:

The documentation for this class was generated from the following files: