Showing tool doc from version 4.6.2.0 | The latest version is
4.6.2.0

**EXPERIMENTAL** FlowPairHMMAlignReadsToHaplotypes

Produces readxhaplotype matrix with likelihoods of read / haplotype

Category Flow Based Tools

Traversal ReadWalker


Overview

A tool to align list of reads to a list of haplotypes. The alignment score is calculated based on assumption that the reads were generated from one of the haplotypes and only sequencing errors. Thus, the alignment score is exactly the likelihood of the read given haplotype that is calculated in HaplotypeCaller.

Input

Output

Since the tool was designed for alignment of the flow-based reads, it currently supports two alignment engines: FlowPairHMM and FlowBasedAlignment (FBA), but can be easily extended. At present, there are two output formats that can be specified using parameter --output-format: extended and concise. The extended format contains a readxhaplotype matrix that shows alignment score of each read versus each haplotype. Condensed format will contain the following columns for each processed read: likelihood score, the best haplotype, the second best haplotype and the difference of alignment scores between the best and the second best haplotype. In addition, as in many cases most of the reads are coming from the "reference" haplotype we can also output the distance from the (marked) reference haplotype

Usage examples

             gatk FlowPairHMMAlignReadsToHaplotypes \
            -H ~{haplotype_list} -O ~{base_file_name}.matches.tsv \
            -I ~{input_bam} --flow-use-t0-tag -E FBA \
            --flow-fill-empty-bins-value 0.00001 --flow-probability-threshold 0.00001 \
            --flow-likelihood-optimized-comp
 
{@GATK.walkertype ReadWalker}

Additional Information

Read filters

This Read Filter is automatically applied to the data by the Engine before processing by FlowPairHMMAlignReadsToHaplotypes.

FlowPairHMMAlignReadsToHaplotypes specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--haplotypes
 -H
Fasta file with haplotypes
--input
 -I
BAM/SAM/CRAM file containing reads
--output
 -O
Read x haplotype log-likelihood matrix
Optional Tool Arguments
--aligner
 -E
FlowBased Aligner: FlowBasedHMM or FlowBasedAligner (FlowBased)
--arguments_file
read one or more arguments files and add them to the command line
--base-quality-score-threshold
18 Base qualities below this threshold will be reduced to the minimum (6)
--cloud-index-prefetch-buffer
 -CIPB
-1 Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
--cloud-prefetch-buffer
 -CPB
40 Size of the cloud-only prefetch buffer (in MB; 0 to disable).
--concise-output-format
false concise or expanded output format: expanded - output full read x haplotype, concise - output for each read best haplotype and score differences from the next best and the reference haplotype, default: false (expanded format)
--disable-bam-index-caching
 -DBIC
false If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
--disable-sequence-dictionary-validation
false If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
--dont-use-dragstr-pair-hmm-scores
false disable DRAGstr pair-hmm score even when dragstr-params-path was provided
--dragstr-het-hom-ratio
2 het to hom prior ratio use with DRAGstr on
--dragstr-params-path
location of the DRAGstr model parameters for STR error correction used in the Pair HMM. When provided, it overrides other PCR error correcting mechanisms
--enable-dynamic-read-disqualification-for-genotyping
false Will enable less strict read disqualification low base quality reads
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
--help
 -h
false display the help message
--interval-merging-rule
 -imr
ALL Interval merging rule for abutting intervals
--intervals
 -L
One or more genomic intervals over which to operate
--native-pair-hmm-threads
4 How many threads should a native pairHMM implementation use
--native-pair-hmm-use-double-precision
false use double precision in the native pairHmm. This is slower but matches the java implementation better
--ref-haplotype
Fasta file with haplotypes
--reference
 -R
Reference sequence
--sites-only-vcf-output
false If true, don't emit genotype fields when writing vcf file output.
--version
false display the version number for this tool
Optional Common Arguments
--add-output-sam-program-record
true If true, adds a PG tag to created SAM/BAM/CRAM files.
--add-output-vcf-command-line
true If true, adds a command line header line to created VCF files.
--create-output-bam-index
 -OBI
true If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
--create-output-bam-md5
 -OBM
false If true, create a MD5 digest for any BAM/SAM/CRAM file created
--create-output-variant-index
 -OVI
true If true, create a VCF index when writing a coordinate-sorted VCF file.
--create-output-variant-md5
 -OVM
false If true, create a a MD5 digest any VCF file created.
--disable-read-filter
 -DF
Read filters to be disabled before analysis
--disable-tool-default-read-filters
false Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
--exclude-intervals
 -XL
One or more genomic intervals to exclude from processing
--gatk-config-file
A configuration file to use with the GATK.
--interval-exclusion-padding
 -ixp
0 Amount of padding (in bp) to add to each interval you are excluding.
--interval-padding
 -ip
0 Amount of padding (in bp) to add to each interval you are including.
--interval-set-rule
 -isr
UNION Set merging approach to use for combining interval inputs
--inverted-read-filter
 -XRF
Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters).
--lenient
 -LE
false Lenient processing of VCF files
--max-variants-per-shard
0 If non-zero, partitions VCF output into shards, each containing up to the given number of records.
--QUIET
false Whether to suppress job-summary info on System.err.
--read-filter
 -RF
Read filters to be applied before analysis
--read-index
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
--read-validation-stringency
 -VS
SILENT Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--seconds-between-progress-updates
10.0 Output traversal statistics every time this many seconds elapse
--sequence-dictionary
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
--tmp-dir
Temp directory to use.
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--disable-cap-base-qualities-to-map-quality
false If false this disables capping of base qualities in the HMM to the mapping quality of the read
--disable-symmetric-hmm-normalizing
false Toggle to revive legacy behavior of asymmetrically normalizing the arguments to the reference haplotype
--expected-mismatch-rate-for-read-disqualification
0.02 Error rate used to set expectation for post HMM read disqualification based on mismatches
--flow-disallow-probs-larger-than-call
false Cap probabilities of error to 1 relative to base call
--flow-fill-empty-bins-value
0.001 Value to fill the zeros of the matrix with
--flow-lump-probs
false Should all probabilities of insertion or deletion in the flow be combined together
--flow-matrix-mods
Modifications instructions to the read flow matrix. Format is src,dst{,src,dst}+. Example: 10,12,11,12 - these instructions will copy element 10 into 11 and 12
--flow-probability-scaling-factor
10 probability scaling factor for (phred=10) for probability quantization
--flow-quantization-bins
121 Number of bins for probability quantization
--flow-remove-non-single-base-pair-indels
false Should the probabilities of more then 1 indel be used
--flow-remove-one-zero-probs
false Remove probabilities of basecall of zero from non-zero genome
--flow-report-insertion-or-deletion
false Report either insertion or deletion, probability, not both
--flow-retain-max-n-probs-base-format
false Keep only hmer/2 probabilities (like in base format)
--flow-symmetric-indel-probs
false Should indel probabilities be symmetric in flow
--flow-use-t0-tag
false Use t0 tag if exists in the read to create flow matrix
--keep-boundary-flows
false prevent spreading of boundary flows.
--likelihood-calculation-engine
PairHMM What likelihood calculation engine to use to calculate the relative likelihood of reads vs haplotypes
--pair-hmm-gap-continuation-penalty
10 Flat gap continuation penalty for use in the Pair HMM
--pair-hmm-implementation
 -pairHMM
FASTEST_AVAILABLE The PairHMM implementation to use for genotype likelihood calculations
--pair-hmm-results-file
File to write exact pairHMM inputs/outputs to for debugging purposes
--pcr-indel-model
CONSERVATIVE The PCR indel model to use
--phred-scaled-global-read-mismapping-rate
45 The global assumed mismapping rate for reads
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--add-output-sam-program-record / -add-output-sam-program-record

If true, adds a PG tag to created SAM/BAM/CRAM files.

boolean  true


--add-output-vcf-command-line / -add-output-vcf-command-line

If true, adds a command line header line to created VCF files.

boolean  true


--aligner / -E

Aligner: FlowBasedHMM or FlowBasedAligner (FlowBased)

The --aligner argument is an enumerated type (Implementation), which can have one of the following values:

PairHMM
Classic full pair-hmm all haplotypes vs all reads.
FlowBased
FlowBasedHMM

Implementation  FlowBased


--arguments_file

read one or more arguments files and add them to the command line

List[File]  []


--base-quality-score-threshold

Base qualities below this threshold will be reduced to the minimum (6)
Bases with a quality below this threshold will reduced to the minimum usable qualiy score (6).

byte  18  [ [ -∞  ∞ ] ]


--cloud-index-prefetch-buffer / -CIPB

Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.

int  -1  [ [ -∞  ∞ ] ]


--cloud-prefetch-buffer / -CPB

Size of the cloud-only prefetch buffer (in MB; 0 to disable).

int  40  [ [ -∞  ∞ ] ]


--concise-output-format

concise or expanded output format: expanded - output full read x haplotype, concise - output for each read best haplotype and score differences from the next best and the reference haplotype, default: false (expanded format)

boolean  false


--create-output-bam-index / -OBI

If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.

boolean  true


--create-output-bam-md5 / -OBM

If true, create a MD5 digest for any BAM/SAM/CRAM file created

boolean  false


--create-output-variant-index / -OVI

If true, create a VCF index when writing a coordinate-sorted VCF file.

boolean  true


--create-output-variant-md5 / -OVM

If true, create a a MD5 digest any VCF file created.

boolean  false


--disable-bam-index-caching / -DBIC

If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.

boolean  false


--disable-cap-base-qualities-to-map-quality

If false this disables capping of base qualities in the HMM to the mapping quality of the read

boolean  false


--disable-read-filter / -DF

Read filters to be disabled before analysis

List[String]  []


--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation

If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!

boolean  false


--disable-symmetric-hmm-normalizing

Toggle to revive legacy behavior of asymmetrically normalizing the arguments to the reference haplotype

boolean  false


--disable-tool-default-read-filters / -disable-tool-default-read-filters

Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)

boolean  false


--dont-use-dragstr-pair-hmm-scores

disable DRAGstr pair-hmm score even when dragstr-params-path was provided

boolean  false


--dragstr-het-hom-ratio

het to hom prior ratio use with DRAGstr on

int  2  [ [ -∞  ∞ ] ]


--dragstr-params-path

location of the DRAGstr model parameters for STR error correction used in the Pair HMM. When provided, it overrides other PCR error correcting mechanisms

GATKPath  null


--enable-dynamic-read-disqualification-for-genotyping

Will enable less strict read disqualification low base quality reads
If enabled, rather than disqualifying all reads over a threshold of minimum hmm scores we will instead choose a less strict and less aggressive cap for disqualification based on the read length and base qualities.

boolean  false


--exclude-intervals / -XL

One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals (e.g. -XL myFile.intervals). strings gathered from the command line -XL argument to be parsed into intervals to exclude

List[String]  []


--expected-mismatch-rate-for-read-disqualification

Error rate used to set expectation for post HMM read disqualification based on mismatches

double  0.02  [ [ -∞  ∞ ] ]


--flow-disallow-probs-larger-than-call

Cap probabilities of error to 1 relative to base call

boolean  false


--flow-fill-empty-bins-value

Value to fill the zeros of the matrix with

double  0.001  [ [ -∞  ∞ ] ]


--flow-lump-probs

Should all probabilities of insertion or deletion in the flow be combined together

boolean  false


--flow-matrix-mods

Modifications instructions to the read flow matrix. Format is src,dst{,src,dst}+. Example: 10,12,11,12 - these instructions will copy element 10 into 11 and 12

String  null


--flow-probability-scaling-factor

probability scaling factor for (phred=10) for probability quantization

int  10  [ [ -∞  ∞ ] ]


--flow-quantization-bins

Number of bins for probability quantization

int  121  [ [ -∞  ∞ ] ]


--flow-remove-non-single-base-pair-indels

Should the probabilities of more then 1 indel be used

boolean  false


--flow-remove-one-zero-probs

Remove probabilities of basecall of zero from non-zero genome

boolean  false


--flow-report-insertion-or-deletion

Report either insertion or deletion, probability, not both

boolean  false


--flow-retain-max-n-probs-base-format

Keep only hmer/2 probabilities (like in base format)

boolean  false


--flow-symmetric-indel-probs

Should indel probabilities be symmetric in flow

boolean  false


--flow-use-t0-tag

Use t0 tag if exists in the read to create flow matrix

boolean  false


--gatk-config-file

A configuration file to use with the GATK.

String  null


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--gcs-project-for-requester-pays

Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.

String  ""


--haplotypes / -H

Fasta file with haplotypes

R GATKPath  null


--help / -h

display the help message

boolean  false


--input / -I

BAM/SAM/CRAM file containing reads

R List[GATKPath]  []


--interval-exclusion-padding / -ixp

Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-merging-rule / -imr

Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not actually overlap) into a single continuous interval. However you can change this behavior if you want them to be treated as separate intervals instead.

The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:

ALL
OVERLAPPING_ONLY

IntervalMergingRule  ALL


--interval-padding / -ip

Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-set-rule / -isr

Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will always be merged using UNION). Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.

The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:

UNION
Take the union of all intervals
INTERSECTION
Take the intersection of intervals (the subset that overlaps all intervals specified)

IntervalSetRule  UNION


--intervals / -L

One or more genomic intervals over which to operate

List[String]  []


--inverted-read-filter / -XRF

Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters).

List[String]  []


--keep-boundary-flows

prevent spreading of boundary flows.

boolean  false


--lenient / -LE

Lenient processing of VCF files

boolean  false


--likelihood-calculation-engine

What likelihood calculation engine to use to calculate the relative likelihood of reads vs haplotypes

The --likelihood-calculation-engine argument is an enumerated type (Implementation), which can have one of the following values:

PairHMM
Classic full pair-hmm all haplotypes vs all reads.
FlowBased
FlowBasedHMM

Implementation  PairHMM


--max-variants-per-shard

If non-zero, partitions VCF output into shards, each containing up to the given number of records.

int  0  [ [ 0  ∞ ] ]


--native-pair-hmm-threads

How many threads should a native pairHMM implementation use

int  4  [ [ -∞  ∞ ] ]


--native-pair-hmm-use-double-precision

use double precision in the native pairHmm. This is slower but matches the java implementation better

boolean  false


--output / -O

Read x haplotype log-likelihood matrix

R String  null


--pair-hmm-gap-continuation-penalty

Flat gap continuation penalty for use in the Pair HMM

int  10  [ [ -∞  ∞ ] ]


--pair-hmm-implementation / -pairHMM

The PairHMM implementation to use for genotype likelihood calculations
The PairHMM implementation to use for genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.

The --pair-hmm-implementation argument is an enumerated type (Implementation), which can have one of the following values:

EXACT
ORIGINAL
LOGLESS_CACHING
AVX_LOGLESS_CACHING
AVX_LOGLESS_CACHING_OMP
FASTEST_AVAILABLE

Implementation  FASTEST_AVAILABLE


--pair-hmm-results-file

File to write exact pairHMM inputs/outputs to for debugging purposes
Argument for generating a file of all of the inputs and outputs for the pair hmm

GATKPath  null


--pcr-indel-model

The PCR indel model to use
When calculating the likelihood of variants, we can try to correct for PCR errors that cause indel artifacts. The correction is based on the reference context, and acts specifically around repetitive sequences that tend to cause PCR errors). The variant likelihoods are penalized in increasing scale as the context around a putative indel is more repetitive (e.g. long homopolymer). The correction can be disabling by specifying '-pcrModel NONE'; in that case the default base insertion/deletion qualities will be used (or taken from the read if generated through the BaseRecalibrator). VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE .

The --pcr-indel-model argument is an enumerated type (PCRErrorModel), which can have one of the following values:

NONE
no specialized PCR error model will be applied; if base insertion/deletion qualities are present they will be used
HOSTILE
a most aggressive model will be applied that sacrifices true positives in order to remove more false positives
AGGRESSIVE
a more aggressive model will be applied that sacrifices true positives in order to remove more false positives
CONSERVATIVE
a less aggressive model will be applied that tries to maintain a high true positive rate at the expense of allowing more false positives

PCRErrorModel  CONSERVATIVE


--phred-scaled-global-read-mismapping-rate

The global assumed mismapping rate for reads
The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haplotype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against any haplotype that this (and any) read could contribute is Q30. Set this term to any negative number to turn off the global mapping rate.

int  45  [ [ -∞  ∞ ] ]


--QUIET

Whether to suppress job-summary info on System.err.

Boolean  false


--read-filter / -RF

Read filters to be applied before analysis

List[String]  []


--read-index / -read-index

Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.

List[GATKPath]  []


--read-validation-stringency / -VS

Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  SILENT


--ref-haplotype

Fasta file with haplotypes

String  null


--reference / -R

Reference sequence

GATKPath  null


--seconds-between-progress-updates / -seconds-between-progress-updates

Output traversal statistics every time this many seconds elapse

double  10.0  [ [ -∞  ∞ ] ]


--sequence-dictionary / -sequence-dictionary

Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.

GATKPath  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--sites-only-vcf-output

If true, don't emit genotype fields when writing vcf file output.

boolean  false


--tmp-dir

Temp directory to use.

GATKPath  null


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.