Showing tool doc from version 4.6.2.0 | The latest version is
4.6.2.0

**BETA** HaplotypeCallerSpark

HaplotypeCaller on Spark

Category Short Variant Discovery


Overview

******************************************************************************** * This tool DOES NOT match the output of HaplotypeCaller. * * It is still under development and should not be used for production work. * * For evaluation only. * * Use the non-spark HaplotypeCaller if you care about the results. * ******************************************************************************** Call germline SNPs and indels via local re-assembly of haplotypes.

This is an implementation of {@link HaplotypeCaller} using spark to distribute the computation. It is still in an early stage of development and does not yet support all the options that the non-spark version does. Specifically it does not support the --dbsnp, --comp, and --bam-output options.

Usage Example

 gatk HaplotypeCallerSpark \
 -R Homo_sapiens_assembly38.fasta \
 -I input.bam \
 -O output.vcf.gz
 

Additional Information

Read filters

These Read Filters are automatically applied to the data by the Engine before processing by HaplotypeCallerSpark.

HaplotypeCallerSpark specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--input
 -I
BAM/SAM/CRAM file containing reads
--output
 -O
Single file to which variants should be written
--reference
 -R
Reference sequence file
Optional Tool Arguments
--alleles
The set of alleles to force-call regardless of evidence
--annotate-with-num-discovered-alleles
false If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
--annotation
 -A
One or more specific annotations to add to variant calls
--annotation-group
 -G
One or more groups of annotations to apply to variant calls
--annotations-to-exclude
 -AX
One or more specific annotations to exclude from variant calls
--arguments_file
read one or more arguments files and add them to the command line
--assembly-region-padding
100 Number of additional bases of context to include around each assembly region
--bam-partition-size
0 maximum number of bytes to read from a file into each partition of reads. Setting this higher will result in fewer partitions. Note that this will not be equal to the size of the partition in memory. Defaults to 0, which uses the default split size (determined by the Hadoop input format, typically the size of one HDFS block).
--base-quality-score-threshold
18 Base qualities below this threshold will be reduced to the minimum (6)
--conf
Spark properties to set on the Spark context in the format =
--contamination-fraction-to-filter
 -contamination
0.0 Fraction of contamination in sequencing data (for all samples) to aggressively remove
--dbsnp
 -D
dbSNP file
--disable-sequence-dictionary-validation
false If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
--dont-use-dragstr-pair-hmm-scores
false disable DRAGstr pair-hmm score even when dragstr-params-path was provided
--dont-use-soft-clipped-bases
false Do not analyze soft clipped bases in the reads
--dragen-mode
false Single argument for enabling the bulk of DRAGEN-GATK features. NOTE: THIS WILL OVERWRITE PROVIDED ARGUMENT CHECK TOOL INFO TO SEE WHICH ARGUMENTS ARE SET).
--dragstr-het-hom-ratio
2 het to hom prior ratio use with DRAGstr on
--dragstr-params-path
location of the DRAGstr model parameters for STR error correction used in the Pair HMM. When provided, it overrides other PCR error correcting mechanisms
--enable-dynamic-read-disqualification-for-genotyping
false Will enable less strict read disqualification low base quality reads
--flow-order-for-annotations
flow order used for this annotations. [readGroup:]flowOrder
--founder-id
Samples representing the population "founders"
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
--genotype-assignment-method
 -gam
USE_PLS_TO_ASSIGN How we assign genotypes
--graph-output
 -graph
Write debug assembly graph information to this file
--help
 -h
false display the help message
--heterozygosity
0.001 Heterozygosity value used to compute prior probabilities for any locus. See the GATKDocs for full details on the meaning of this population genetics concept
--heterozygosity-stdev
0.01 Standard deviation of heterozygosity for SNP and indel calling.
--indel-heterozygosity
1.25E-4 Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept
--interval-merging-rule
 -imr
ALL Interval merging rule for abutting intervals
--intervals
 -L
One or more genomic intervals over which to operate
--max-assembly-region-size
300 Maximum size of an assembly region
--max-reads-per-alignment-start
50 Maximum number of reads to retain per alignment start position. Reads above this threshold will be downsampled. Set to 0 to disable.
--min-assembly-region-size
50 Minimum size of an assembly region
--min-base-quality-score
 -mbq
10 Minimum base quality required to consider a base for calling
--native-pair-hmm-threads
4 How many threads should a native pairHMM implementation use
--native-pair-hmm-use-double-precision
false use double precision in the native pairHmm. This is slower but matches the java implementation better
--num-reducers
0 For tools that shuffle data or write an output, sets the number of reducers. Defaults to 0, which gives one partition per 10MB of input.
--num-reference-samples-if-no-call
0 Number of hom-ref genotypes to infer at sites not present in a panel
--output-mode
EMIT_VARIANTS_ONLY Specifies which type of calls we should output
--output-shard-tmp-dir
when writing a bam, in single sharded mode this directory to write the temporary intermediate output shards, if not specified .parts/ will be used
--pedigree
 -ped
Pedigree file for determining the population "founders"
--ploidy-regions
Interval file with column specifying desired ploidy for genotyping models. Overrides default ploidy and user-provided --ploidy argument in specific regions.
--population-callset
 -population
Callset to use in calculating genotype priors
--program-name
Name of the program running
--read-shard-padding
100 Each read shard has this many bases of extra context on each side. Read shards must have as much or more padding than assembly regions.
--read-shard-size
5000 Maximum size of each read shard, in bases. For good performance, this should be much larger than the maximum assembly region size.
--recover-dangling-heads
false (Deprecated) This argument is deprecated since version 3.3
--sample-name
 -ALIAS
Name of single sample to use from a multi-sample bam
--sample-ploidy
 -ploidy
2 Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
--sharded-output
false For tools that write an output, write the output in multiple pieces (shards)
--shuffle
false whether to use the shuffle implementation or not
--spark-master
local[*] URL of the Spark Master to submit jobs to when using the Spark pipeline runner.
--spark-verbosity
Spark verbosity. Overrides --verbosity for Spark-generated logs only. Possible values: {ALL, DEBUG, INFO, WARN, ERROR, FATAL, OFF, TRACE}
--standard-min-confidence-threshold-for-calling
 -stand-call-conf
30.0 The minimum phred-scaled confidence threshold at which variants should be called
--strict
false whether to use the strict implementation or not (defaults to the faster implementation that doesn't strictly match the walker version)
--use-new-qual-calculator
 -new-qual
true (Deprecated) Use the new AF model instead of the so-called exact model
--use-nio
false Whether to use NIO or the Hadoop filesystem (default) for reading files. (Note that the Hadoop filesystem is always used for writing files.)
--use-pdhmm
false Partially Determined HMM, an alternative to the regular assembly haplotypes where we instead construct artificial haplotypes out of the union of the assembly and pileup alleles.
--use-posteriors-to-calculate-qual
 -gp-qual
false if available, use the genotype posterior probabilities to calculate the site QUAL
--version
false display the version number for this tool
Optional Common Arguments
--add-output-vcf-command-line
true If true, adds a command line header line to created VCF files.
--create-output-bam-index
 -OBI
true If true, create a BAM index when writing a coordinate-sorted BAM file.
--create-output-bam-splitting-index
true If true, create a BAM splitting index (SBI) when writing a coordinate-sorted BAM file.
--create-output-variant-index
 -OVI
true If true, create a VCF index when writing a coordinate-sorted VCF file.
--disable-read-filter
 -DF
Read filters to be disabled before analysis
--disable-tool-default-read-filters
false Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
--exclude-intervals
 -XL
One or more genomic intervals to exclude from processing
--gatk-config-file
A configuration file to use with the GATK.
--interval-exclusion-padding
 -ixp
0 Amount of padding (in bp) to add to each interval you are excluding.
--interval-padding
 -ip
0 Amount of padding (in bp) to add to each interval you are including.
--interval-set-rule
 -isr
UNION Set merging approach to use for combining interval inputs
--inverted-read-filter
 -XRF
Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters).
--QUIET
false Whether to suppress job-summary info on System.err.
--read-filter
 -RF
Read filters to be applied before analysis
--read-index
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
--read-validation-stringency
 -VS
SILENT Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--splitting-index-granularity
4096 Granularity to use when writing a splitting index, one entry will be put into the index every n reads where n is this granularity value. Smaller granularity results in a larger index with more available split points.
--tmp-dir
Temp directory to use.
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--active-probability-threshold
0.002 Minimum probability for a locus to be considered active.
--adaptive-pruning
false Use Mutect2's adaptive graph pruning algorithm
--adaptive-pruning-initial-error-rate
0.001 Initial base error rate estimate for adaptive pruning
--all-site-pls
false Annotate all sites with PLs
--allele-informative-reads-overlap-margin
2 Likelihood and read-based annotations will only take into consideration reads that overlap the variant or any base no further than this distance expressed in base pairs
--allow-non-unique-kmers-in-ref
false Allow graphs that have non-unique kmers in the reference
--apply-bqd
false If enabled this argument will apply the DRAGEN-GATK BaseQualityDropout model to the genotyping model for filtering sites due to Linked Error mode.
--apply-frd
false If enabled this argument will apply the DRAGEN-GATK ForeignReadDetection model to the genotyping model for filtering sites.
--bam-output
 -bamout
File to which assembled haplotypes should be written
--bam-writer-type
CALLED_HAPLOTYPES Which haplotypes should be written to the BAM
--comparison
 -comp
Comparison VCF file(s)
--contamination-fraction-per-sample-file
 -contamination-file
Tab-separated File containing fraction of contamination in sequencing data (per sample) to aggressively remove. Format should be "" (Contamination is double) per line; No header.
--debug-assembly
 -debug
false Print out verbose debug information about each assembly region
--disable-cap-base-qualities-to-map-quality
false If false this disables capping of base qualities in the HMM to the mapping quality of the read
--disable-optimizations
false Don't skip calculations in ActiveRegions with no variants
--disable-spanning-event-genotyping
false If enabled this argument will disable inclusion of the '*' spanning event when genotyping events that overlap deletions
--disable-symmetric-hmm-normalizing
false Toggle to revive legacy behavior of asymmetrically normalizing the arguments to the reference haplotype
--disable-tool-default-annotations
false Disable all tool default annotations
--do-not-correct-overlapping-quality
false Disable overlapping base quality correction
--do-not-run-physical-phasing
false Disable physical phasing
--dont-increase-kmer-sizes-for-cycles
false Disable iterating over kmer sizes when graph cycles are detected
--dont-use-dragstr-priors
false Forfeit the use of the DRAGstr model to calculate genotype priors. This argument does not have any effect in the absence of DRAGstr model parameters (--dragstr-model-params)
--dragen-378-concordance-mode
false Single argument for enabling the bulk of DRAGEN-GATK features including new developments for concordance against DRAGEN 3.7.8. NOTE: THIS WILL OVERWRITE PROVIDED ARGUMENT CHECK TOOL INFO TO SEE WHICH ARGUMENTS ARE SET).
--emit-ref-confidence
 -ERC
NONE Mode for emitting reference confidence scores (For Mutect2, this is a BETA feature)
--enable-all-annotations
false Use all possible annotations (not for the faint of heart)
--expected-mismatch-rate-for-read-disqualification
0.02 Error rate used to set expectation for post HMM read disqualification based on mismatches
--floor-blocks
false Output the band lower bound for each GQ block regardless of the data it represents
--flow-assembly-collapse-partial-mode
false Collapse long flow-based hmers only up to difference in reference
--flow-disallow-probs-larger-than-call
false Cap probabilities of error to 1 relative to base call
--flow-fill-empty-bins-value
0.001 Value to fill the zeros of the matrix with
--flow-filter-alleles
false pre-filter alleles before genotyping
--flow-filter-alleles-qual-threshold
30.0 Threshold for prefiltering alleles on quality
--flow-filter-alleles-sor-threshold
3.0 Threshold for prefiltering alleles on SOR
--flow-filter-lone-alleles
false Remove also lone alleles during allele filtering
--flow-lump-probs
false Should all probabilities of insertion or deletion in the flow be combined together
--flow-matrix-mods
Modifications instructions to the read flow matrix. Format is src,dst{,src,dst}+. Example: 10,12,11,12 - these instructions will copy element 10 into 11 and 12
--flow-mode
NONE Single argument for enabling the bulk of Flow Based features. NOTE: THIS WILL OVERWRITE PROVIDED ARGUMENT CHECK TOOL INFO TO SEE WHICH ARGUMENTS ARE SET).
--flow-probability-scaling-factor
10 probability scaling factor for (phred=10) for probability quantization
--flow-quantization-bins
121 Number of bins for probability quantization
--flow-remove-non-single-base-pair-indels
false Should the probabilities of more then 1 indel be used
--flow-remove-one-zero-probs
false Remove probabilities of basecall of zero from non-zero genome
--flow-report-insertion-or-deletion
false Report either insertion or deletion, probability, not both
--flow-retain-max-n-probs-base-format
false Keep only hmer/2 probabilities (like in base format)
--flow-symmetric-indel-probs
false Should indel probabilities be symmetric in flow
--flow-use-t0-tag
false Use t0 tag if exists in the read to create flow matrix
--force-active
false If provided, all regions will be marked as active
--force-call-filtered-alleles
 -genotype-filtered-alleles
false Force-call filtered alleles included in the resource specified by --alleles
--gvcf-gq-bands
 -GQB
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99] Exclusive upper bounds for reference confidence GQ bands (must be in [1, 100] and specified in increasing order)
--indel-size-to-eliminate-in-ref-model
10 The size of an indel to check for in the reference model
--keep-boundary-flows
false prevent spreading of boundary flows.
--kmer-size
[10, 25] Kmer size to use in the read threading assembler
--likelihood-calculation-engine
PairHMM What likelihood calculation engine to use to calculate the relative likelihood of reads vs haplotypes
--linked-de-bruijn-graph
false If enabled, the Assembly Engine will construct a Linked De Bruijn graph to recover better haplotypes
--mapping-quality-threshold-for-genotyping
20 Control the threshold for discounting reads from the genotyper due to mapping quality after the active region detection and assembly steps but before genotyping. NOTE: this is in contrast to the --minimum-mapping-quality argument which filters reads from all parts of the HaplotypeCaller. If you would like to call genotypes with a different threshold both arguments must be set.
--max-alternate-alleles
6 Maximum number of alternate alleles to genotype
--max-effective-depth-adjustment-for-frd
0 Set the maximum depth to modify FRD adjustment to in the event of high depth sites (0 to disable)
--max-genotype-count
1024 Maximum number of genotypes to consider at any site
--max-mnp-distance
 -mnp-dist
0 Two or more phased substitutions separated by this distance or less are merged into MNPs.
--max-num-haplotypes-in-population
128 Maximum number of haplotypes to consider for your population
--max-prob-propagation-distance
50 Upper limit on how many bases away probability mass can be moved around when calculating the boundaries between active and inactive assembly regions
--max-unpruned-variants
100 Maximum number of variants in graph the adaptive pruner will allow
--min-dangling-branch-length
4 Minimum length of a dangling branch to attempt recovery
--min-pruning
2 Minimum support to not prune paths in the graph
--num-pruning-samples
1 Number of samples that must pass the minPruning threshold
--pair-hmm-gap-continuation-penalty
10 Flat gap continuation penalty for use in the Pair HMM
--pair-hmm-implementation
 -pairHMM
FASTEST_AVAILABLE The PairHMM implementation to use for genotype likelihood calculations
--pair-hmm-results-file
File to write exact pairHMM inputs/outputs to for debugging purposes
--pcr-indel-model
CONSERVATIVE The PCR indel model to use
--phred-scaled-global-read-mismapping-rate
45 The global assumed mismapping rate for reads
--pileup-detection
false If enabled, the variant caller will create pileup-based haplotypes in addition to the assembly-based haplotype generation.
--pruning-lod-threshold
2.302585092994046 Ln likelihood ratio threshold for adaptive pruning algorithm
--pruning-seeding-lod-threshold
9.210340371976184 Ln likelihood ratio threshold for seeding subgraph of good variation in adaptive pruning algorithm
--recover-all-dangling-branches
false Recover all dangling branches
--reference-model-deletion-quality
30 The quality of deletion in the reference model
--showHidden
false display hidden arguments
--smith-waterman
FASTEST_AVAILABLE Which Smith-Waterman implementation to use, generally FASTEST_AVAILABLE is the right choice
--smith-waterman-dangling-end-gap-extend-penalty
-6 Smith-Waterman gap-extend penalty for dangling-end recovery.
--smith-waterman-dangling-end-gap-open-penalty
-110 Smith-Waterman gap-open penalty for dangling-end recovery.
--smith-waterman-dangling-end-match-value
25 Smith-Waterman match value for dangling-end recovery.
--smith-waterman-dangling-end-mismatch-penalty
-50 Smith-Waterman mismatch penalty for dangling-end recovery.
--smith-waterman-haplotype-to-reference-gap-extend-penalty
-11 Smith-Waterman gap-extend penalty for haplotype-to-reference alignment.
--smith-waterman-haplotype-to-reference-gap-open-penalty
-260 Smith-Waterman gap-open penalty for haplotype-to-reference alignment.
--smith-waterman-haplotype-to-reference-match-value
200 Smith-Waterman match value for haplotype-to-reference alignment.
--smith-waterman-haplotype-to-reference-mismatch-penalty
-150 Smith-Waterman mismatch penalty for haplotype-to-reference alignment.
--smith-waterman-read-to-haplotype-gap-extend-penalty
-5 Smith-Waterman gap-extend penalty for read-to-haplotype alignment.
--smith-waterman-read-to-haplotype-gap-open-penalty
-30 Smith-Waterman gap-open penalty for read-to-haplotype alignment.
--smith-waterman-read-to-haplotype-match-value
10 Smith-Waterman match value for read-to-haplotype alignment.
--smith-waterman-read-to-haplotype-mismatch-penalty
-15 Smith-Waterman mismatch penalty for read-to-haplotype alignment.
--soft-clip-low-quality-ends
false If enabled will preserve low-quality read ends as softclips (used for DRAGEN-GATK BQD genotyper model)
--transform-dragen-mapping-quality
false If enabled this argument will map DRAGEN aligner aligned reads with mapping quality <=250 to scale up to MQ 50
--use-filtered-reads-for-annotations
false Use the contamination-filtered read maps for the purposes of annotating variants
--use-pdhmm-overlap-optimization
false PDHMM: An optimization to PDHMM, if set this will skip running PDHMM haplotype determination on reads that don't overlap (within a few bases) of the determined allele in each haplotype. This substantially reduces the amount of read-haplotype comparisons at the expense of ignoring read realignment mapping artifacts. (Requires '--use-pdhmm' argument)
Deprecated Arguments
--recover-dangling-heads
false (Deprecated) This argument is deprecated since version 3.3
--use-new-qual-calculator
 -new-qual
true (Deprecated) Use the new AF model instead of the so-called exact model

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--active-probability-threshold

Minimum probability for a locus to be considered active.

double  0.002  [ [ -∞  ∞ ] ]


--adaptive-pruning

Use Mutect2's adaptive graph pruning algorithm
A single edge multiplicity cutoff for pruning doesn't work in samples with variable depths, for example exomes and RNA. This parameter enables the probabilistic algorithm for pruning the assembly graph that considers the likelihood that each chain in the graph comes from real variation.

boolean  false


--adaptive-pruning-initial-error-rate

Initial base error rate estimate for adaptive pruning
Initial base error rate guess for the probabilistic adaptive pruning model. Results are not very sensitive to this parameter because it is only a starting point from which the algorithm discovers the true error rate.

double  0.001  [ [ -∞  ∞ ] ]


--add-output-vcf-command-line / -add-output-vcf-command-line

If true, adds a command line header line to created VCF files.

boolean  true


--all-site-pls

Annotate all sites with PLs
Advanced, experimental argument: if SNP likelihood model is specified, and if EMIT_ALL_ACTIVE_SITES output mode is set, when we set this argument then we will also emit PLs at all sites. This will give a measure of reference confidence and a measure of which alt alleles are more plausible (if any). WARNINGS: - This feature will inflate VCF file size considerably. - All SNP ALT alleles will be emitted with corresponding 10 PL values. - An error will be emitted if EMIT_ALL_ACTIVE_SITES is not set, or if anything other than diploid SNP model is used

boolean  false


--allele-informative-reads-overlap-margin

Likelihood and read-based annotations will only take into consideration reads that overlap the variant or any base no further than this distance expressed in base pairs

int  2  [ [ -∞  ∞ ] ]


--alleles

The set of alleles to force-call regardless of evidence

FeatureInput[VariantContext]  null


--allow-non-unique-kmers-in-ref

Allow graphs that have non-unique kmers in the reference
By default, the program does not allow processing of reference sections that contain non-unique kmers. Disabling this check may cause problems in the assembly graph.

boolean  false


--annotate-with-num-discovered-alleles

If provided, we will annotate records with the number of alternate alleles that were discovered (but not necessarily genotyped) at a given site
Depending on the value of the --max_alternate_alleles argument, we may genotype only a fraction of the alleles being sent on for genotyping. Using this argument instructs the genotyper to annotate (in the INFO field) the number of alternate alleles that were originally discovered at the site.

boolean  false


--annotation / -A

One or more specific annotations to add to variant calls
Which annotations to include in variant calls in the output. These supplement annotations provided by annotation groups.

List[String]  []


--annotation-group / -G

One or more groups of annotations to apply to variant calls
Which groups of annotations to add to the output variant calls. Any requirements that are not met (e.g. failing to provide a pedigree file for a pedigree-based annotation) may cause the run to fail.

List[String]  []


--annotations-to-exclude / -AX

One or more specific annotations to exclude from variant calls
Which annotations to exclude from output in the variant calls. Note that this argument has higher priority than the -A or -G arguments, so these annotations will be excluded even if they are explicitly included with the other options.

List[String]  []


--apply-bqd

If enabled this argument will apply the DRAGEN-GATK BaseQualityDropout model to the genotyping model for filtering sites due to Linked Error mode.

boolean  false


--apply-frd

If enabled this argument will apply the DRAGEN-GATK ForeignReadDetection model to the genotyping model for filtering sites.

boolean  false


--arguments_file

read one or more arguments files and add them to the command line

List[File]  []


--assembly-region-padding

Number of additional bases of context to include around each assembly region
Parameters that control assembly regions

int  100  [ [ -∞  ∞ ] ]


--bam-output / -bamout

File to which assembled haplotypes should be written
The assembled haplotypes and locally realigned reads will be written as BAM to this file if requested. Really for debugging purposes only. Note that the output here does not include uninformative reads so that not every input read is emitted to the bam. Turning on this mode may result in serious performance cost for the caller. It's really only appropriate to use in specific areas where you want to better understand why the caller is making specific calls. The reads are written out containing an "HC" tag (integer) that encodes which haplotype each read best matches according to the haplotype caller's likelihood calculation. The use of this tag is primarily intended to allow good coloring of reads in IGV. Simply go to "Color Alignments By > Tag" and enter "HC" to more easily see which reads go with these haplotype. Note that the haplotypes (called or all, depending on mode) are emitted as single reads covering the entire active region, coming from sample "HC" and a special read group called "ArtificialHaplotype". This will increase the pileup depth compared to what would be expected from the reads only, especially in complex regions. Note also that only reads that are actually informative about the haplotypes are emitted. By informative we mean that there's a meaningful difference in the likelihood of the read coming from one haplotype compared to its next best haplotype. If multiple BAMs are passed as input to the tool (as is common for M2), then they will be combined in the bamout output and tagged with the appropriate sample names. The best way to visualize the output of this mode is with IGV. Tell IGV to color the alignments by tag, and give it the "HC" tag, so you can see which reads support each haplotype. Finally, you can tell IGV to group by sample, which will separate the potential haplotypes from the reads. All of this can be seen in https://www.dropbox.com/s/xvy7sbxpf13x5bp/haplotypecaller%20bamout%20for%20docs.png this screenshot

String  null


--bam-partition-size

maximum number of bytes to read from a file into each partition of reads. Setting this higher will result in fewer partitions. Note that this will not be equal to the size of the partition in memory. Defaults to 0, which uses the default split size (determined by the Hadoop input format, typically the size of one HDFS block).

long  0  [ [ -∞  ∞ ] ]


--bam-writer-type

Which haplotypes should be written to the BAM
The type of BAM output we want to see. This determines whether HC will write out all of the haplotypes it considered (top 128 max) or just the ones that were selected as alleles and assigned to samples.

The --bam-writer-type argument is an enumerated type (WriterType), which can have one of the following values:

ALL_POSSIBLE_HAPLOTYPES
A mode that's for method developers. Writes out all of the possible haplotypes considered, as well as reads aligned to each
CALLED_HAPLOTYPES
A mode for users. Writes out the reads aligned only to the called haplotypes. Useful to understand why the caller is calling what it is
NO_HAPLOTYPES
With this option, haplotypes will not be included in the output bam.
CALLED_HAPLOTYPES_NO_READS
Same as CALLED_HAPLOTYPES, but without reads

WriterType  CALLED_HAPLOTYPES


--base-quality-score-threshold

Base qualities below this threshold will be reduced to the minimum (6)
Bases with a quality below this threshold will reduced to the minimum usable qualiy score (6).

byte  18  [ [ -∞  ∞ ] ]


--comparison / -comp

Comparison VCF file(s)
If a call overlaps with a record from the provided comp track, the INFO field will be annotated as such in the output with the track name (e.g. -comp:FOO will have 'FOO' in the INFO field). Records that are filtered in the comp track will be ignored. Note that 'dbSNP' has been special-cased (see the --dbsnp argument).

List[FeatureInput[VariantContext]]  []


--conf

Spark properties to set on the Spark context in the format =

List[String]  []


--contamination-fraction-per-sample-file / -contamination-file

Tab-separated File containing fraction of contamination in sequencing data (per sample) to aggressively remove. Format should be "" (Contamination is double) per line; No header.
This argument specifies a file with two columns "sample" and "contamination" specifying the contamination level for those samples. Samples that do not appear in this file will be processed with CONTAMINATION_FRACTION.

File  null


--contamination-fraction-to-filter / -contamination

Fraction of contamination in sequencing data (for all samples) to aggressively remove
If this fraction is greater is than zero, the caller will aggressively attempt to remove contamination through biased down-sampling of reads. Basically, it will ignore the contamination fraction of reads for each alternate allele. So if the pileup contains N total bases, then we will try to remove (N * contamination fraction) bases for each alternate allele.

double  0.0  [ [ -∞  ∞ ] ]


--create-output-bam-index / -OBI

If true, create a BAM index when writing a coordinate-sorted BAM file.

boolean  true


--create-output-bam-splitting-index

If true, create a BAM splitting index (SBI) when writing a coordinate-sorted BAM file.

boolean  true


--create-output-variant-index / -OVI

If true, create a VCF index when writing a coordinate-sorted VCF file.

boolean  true


--dbsnp / -D

dbSNP file
A dbSNP VCF file.

FeatureInput[VariantContext]  null


--debug-assembly / -debug

Print out verbose debug information about each assembly region

boolean  false


--disable-cap-base-qualities-to-map-quality

If false this disables capping of base qualities in the HMM to the mapping quality of the read

boolean  false


--disable-optimizations

Don't skip calculations in ActiveRegions with no variants
If set, certain "early exit" optimizations in HaplotypeCaller, which aim to save compute and time by skipping calculations if an ActiveRegion is determined to contain no variants, will be disabled. This is most likely to be useful if you're using the -bamout argument to examine the placement of reads following reassembly and are interested in seeing the mapping of reads in regions with no variations. Setting the --force-active flag may also be necessary.

boolean  false


--disable-read-filter / -DF

Read filters to be disabled before analysis

List[String]  []


--disable-sequence-dictionary-validation / -disable-sequence-dictionary-validation

If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!

boolean  false


--disable-spanning-event-genotyping

If enabled this argument will disable inclusion of the '*' spanning event when genotyping events that overlap deletions

boolean  false


--disable-symmetric-hmm-normalizing

Toggle to revive legacy behavior of asymmetrically normalizing the arguments to the reference haplotype

boolean  false


--disable-tool-default-annotations / -disable-tool-default-annotations

Disable all tool default annotations
Hook allowing for the user to remove default annotations from the tool

boolean  false


--disable-tool-default-read-filters / -disable-tool-default-read-filters

Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)

boolean  false


--do-not-correct-overlapping-quality

Disable overlapping base quality correction
Base quality is capped at half of PCR error rate for bases where read and mate overlap, to account for full correlation of PCR errors at these bases. This argument disables that correction.

boolean  false


--do-not-run-physical-phasing

Disable physical phasing
As of GATK 3.3, HaplotypeCaller outputs physical (read-based) information (see version 3.3 release notes and documentation for details). This argument disables that behavior.

boolean  false


--dont-increase-kmer-sizes-for-cycles

Disable iterating over kmer sizes when graph cycles are detected
When graph cycles are detected, the normal behavior is to increase kmer sizes iteratively until the cycles are resolved. Disabling this behavior may cause the program to give up on assembling the ActiveRegion.

boolean  false


--dont-use-dragstr-pair-hmm-scores

disable DRAGstr pair-hmm score even when dragstr-params-path was provided

boolean  false


--dont-use-dragstr-priors

Forfeit the use of the DRAGstr model to calculate genotype priors. This argument does not have any effect in the absence of DRAGstr model parameters (--dragstr-model-params)

boolean  false


--dont-use-soft-clipped-bases

Do not analyze soft clipped bases in the reads

boolean  false


--dragen-378-concordance-mode

Single argument for enabling the bulk of DRAGEN-GATK features including new developments for concordance against DRAGEN 3.7.8. NOTE: THIS WILL OVERWRITE PROVIDED ARGUMENT CHECK TOOL INFO TO SEE WHICH ARGUMENTS ARE SET).
DRAGEN-GATK version 2: This includes PDHMM and Columnwise detection (with hopes to add Joint Detection and new STRE as well in the future)

Exclusion: This argument cannot be used at the same time as dragen-mode.

Boolean  false


--dragen-mode

Single argument for enabling the bulk of DRAGEN-GATK features. NOTE: THIS WILL OVERWRITE PROVIDED ARGUMENT CHECK TOOL INFO TO SEE WHICH ARGUMENTS ARE SET).
DRAGEN-GATK mode changes a long list of arguments to support running DRAGEN-GATK with FRD + BQD + STRE (with or without a provided STRE table provided):

Exclusion: This argument cannot be used at the same time as dragen-378-concordance-mode.

Boolean  false


--dragstr-het-hom-ratio

het to hom prior ratio use with DRAGstr on

int  2  [ [ -∞  ∞ ] ]


--dragstr-params-path

location of the DRAGstr model parameters for STR error correction used in the Pair HMM. When provided, it overrides other PCR error correcting mechanisms

GATKPath  null


--emit-ref-confidence / -ERC

Mode for emitting reference confidence scores (For Mutect2, this is a BETA feature)
The reference confidence mode makes it possible to emit a per-bp or summarized confidence estimate for a site being strictly homozygous-reference. See https://software.broadinstitute.org/gatk/documentation/article.php?id=4017 for information about GVCFs. For Mutect2, this is a BETA feature that functions similarly to the HaplotypeCaller reference confidence/GVCF mode.

The --emit-ref-confidence argument is an enumerated type (ReferenceConfidenceMode), which can have one of the following values:

NONE
Regular calling without emitting reference confidence calls.
BP_RESOLUTION
Reference model emitted site by site.
GVCF
Reference model emitted with condensed non-variant blocks, i.e. the GVCF format.

ReferenceConfidenceMode  NONE


--enable-all-annotations

Use all possible annotations (not for the faint of heart)
You can use the -AX argument in combination with this one to exclude specific annotations. Note that some annotations may not be actually applied if they are not applicable to the data provided or if they are unavailable to the tool (e.g. there are several annotations that are currently not hooked up to HaplotypeCaller). At present no error or warning message will be provided, the annotation will simply be skipped silently. You can check the output VCF header to see which annotations were activated and thus might be applied (although this does not guarantee that the annotation was applied to all records in the VCF, since some annotations have additional requirements, e.g. minimum number of samples or heterozygous sites only -- see the documentation for individual annotations' requirements).

boolean  false


--enable-dynamic-read-disqualification-for-genotyping

Will enable less strict read disqualification low base quality reads
If enabled, rather than disqualifying all reads over a threshold of minimum hmm scores we will instead choose a less strict and less aggressive cap for disqualification based on the read length and base qualities.

boolean  false


--exclude-intervals / -XL

One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals (e.g. -XL myFile.intervals). strings gathered from the command line -XL argument to be parsed into intervals to exclude

List[String]  []


--expected-mismatch-rate-for-read-disqualification

Error rate used to set expectation for post HMM read disqualification based on mismatches

double  0.02  [ [ -∞  ∞ ] ]


--floor-blocks

Output the band lower bound for each GQ block regardless of the data it represents
Output the band lower bound for each GQ block instead of the min GQ -- for better compression

boolean  false


--flow-assembly-collapse-partial-mode

Collapse long flow-based hmers only up to difference in reference

boolean  false


--flow-disallow-probs-larger-than-call

Cap probabilities of error to 1 relative to base call

boolean  false


--flow-fill-empty-bins-value

Value to fill the zeros of the matrix with

double  0.001  [ [ -∞  ∞ ] ]


--flow-filter-alleles

pre-filter alleles before genotyping

boolean  false


--flow-filter-alleles-qual-threshold

Threshold for prefiltering alleles on quality

float  30.0  [ [ -∞  ∞ ] ]


--flow-filter-alleles-sor-threshold

Threshold for prefiltering alleles on SOR

float  3.0  [ [ -∞  ∞ ] ]


--flow-filter-lone-alleles

Remove also lone alleles during allele filtering

boolean  false


--flow-lump-probs

Should all probabilities of insertion or deletion in the flow be combined together

boolean  false


--flow-matrix-mods

Modifications instructions to the read flow matrix. Format is src,dst{,src,dst}+. Example: 10,12,11,12 - these instructions will copy element 10 into 11 and 12

String  null


--flow-mode

Single argument for enabling the bulk of Flow Based features. NOTE: THIS WILL OVERWRITE PROVIDED ARGUMENT CHECK TOOL INFO TO SEE WHICH ARGUMENTS ARE SET).

The --flow-mode argument is an enumerated type (FlowMode), which can have one of the following values:

NONE
STANDARD
ADVANCED

FlowMode  NONE


--flow-order-for-annotations

flow order used for this annotations. [readGroup:]flowOrder

List[String]  []


--flow-probability-scaling-factor

probability scaling factor for (phred=10) for probability quantization

int  10  [ [ -∞  ∞ ] ]


--flow-quantization-bins

Number of bins for probability quantization

int  121  [ [ -∞  ∞ ] ]


--flow-remove-non-single-base-pair-indels

Should the probabilities of more then 1 indel be used

boolean  false


--flow-remove-one-zero-probs

Remove probabilities of basecall of zero from non-zero genome

boolean  false


--flow-report-insertion-or-deletion

Report either insertion or deletion, probability, not both

boolean  false


--flow-retain-max-n-probs-base-format

Keep only hmer/2 probabilities (like in base format)

boolean  false


--flow-symmetric-indel-probs

Should indel probabilities be symmetric in flow

boolean  false


--flow-use-t0-tag

Use t0 tag if exists in the read to create flow matrix

boolean  false


--force-active

If provided, all regions will be marked as active

boolean  false


--force-call-filtered-alleles / -genotype-filtered-alleles

Force-call filtered alleles included in the resource specified by --alleles

boolean  false


--founder-id / -founder-id

Samples representing the population "founders"

List[String]  []


--gatk-config-file

A configuration file to use with the GATK.

String  null


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--gcs-project-for-requester-pays

Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.

String  ""


--genotype-assignment-method / -gam

How we assign genotypes

The --genotype-assignment-method argument is an enumerated type (GenotypeAssignmentMethod), which can have one of the following values:

SET_TO_NO_CALL
set all of the genotype GT values to NO_CALL
USE_PLS_TO_ASSIGN
Use the subsetted PLs to greedily assign genotypes
USE_POSTERIORS_ANNOTATION
Use the existing, subsetted posteriors array to assign genotypes
SET_TO_NO_CALL_NO_ANNOTATIONS
set all of the genotype GT values to NO_CALL and remove annotations
BEST_MATCH_TO_ORIGINAL
Try to match the original GT calls, if at all possible Suppose I have 3 alleles: A/B/C and the following samples: original_GT best_match to A/B best_match to A/C S1 => A/A A/A A/A S2 => A/B A/B A/A S3 => B/B B/B A/A S4 => B/C A/B A/C S5 => C/C A/A C/C Basically, all alleles not in the subset map to ref. It means that het-alt genotypes when split into 2 bi-allelic variants will be het in each, which is good in some cases, rather than the undetermined behavior when using the PLs to assign, which could result in hom-var or hom-ref for each, depending on the exact PL values.
DO_NOT_ASSIGN_GENOTYPES
do not even bother changing the GTs
USE_POSTERIOR_PROBABILITIES
Calculate posterior probabilities and use those to assign genotypes
PREFER_PLS
Use PLs unless they are unavailable, in which case use best match to original GQ0 hom-refs will be converted to no-calls

GenotypeAssignmentMethod  USE_PLS_TO_ASSIGN


--graph-output / -graph

Write debug assembly graph information to this file
This argument is meant for debugging and is not immediately useful for normal analysis use.

String  null


--gvcf-gq-bands / -GQB

Exclusive upper bounds for reference confidence GQ bands (must be in [1, 100] and specified in increasing order)
When HC is run in reference confidence mode with banding compression enabled (-ERC GVCF), homozygous-reference sites are compressed into bands of similar genotype quality (GQ) that are emitted as a single VCF record. See the FAQ documentation for more details about the GVCF format. This argument allows you to set the GQ bands. HC expects a list of strictly increasing GQ values that will act as exclusive upper bounds for the GQ bands. To pass multiple values, you provide them one by one with the argument, as in `-GQB 10 -GQB 20 -GQB 30` and so on (this would set the GQ bands to be `[0, 10), [10, 20), [20, 30)` and so on, for example). Note that GQ values are capped at 99 in the GATK, so values must be integers in [1, 100]. If the last value is strictly less than 100, the last GQ band will start at that value (inclusive) and end at 100 (exclusive).

List[Integer]  [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 70, 80, 90, 99]


--help / -h

display the help message

boolean  false


--heterozygosity

Heterozygosity value used to compute prior probabilities for any locus. See the GATKDocs for full details on the meaning of this population genetics concept
The expected heterozygosity value used to compute prior probability that a locus is non-reference. The default priors are for provided for humans: het = 1e-3 which means that the probability of N samples being hom-ref at a site is: 1 - sum_i_2N (het / i) Note that heterozygosity as used here is the population genetics concept: http://en.wikipedia.org/wiki/Zygosity#Heterozygosity_in_population_genetics That is, a hets value of 0.01 implies that two randomly chosen chromosomes from the population of organisms would differ from each other (one being A and the other B) at a rate of 1 in 100 bp. Note that this quantity has nothing to do with the likelihood of any given sample having a heterozygous genotype, which in the GATK is purely determined by the probability of the observed data P(D | AB) under the model that there may be a AB het genotype. The posterior probability of this AB genotype would use the het prior, but the GATK only uses this posterior probability in determining the prob. that a site is polymorphic. So changing the het parameters only increases the chance that a site will be called non-reference across all samples, but doesn't actually change the output genotype likelihoods at all, as these aren't posterior probabilities at all. The quantity that changes whether the GATK considers the possibility of a het genotype at all is the ploidy, which determines how many chromosomes each individual in the species carries.

Double  0.001  [ [ -∞  ∞ ] ]


--heterozygosity-stdev

Standard deviation of heterozygosity for SNP and indel calling.
The standard deviation of the distribution of alt allele fractions. The above heterozygosity parameters give the *mean* of this distribution; this parameter gives its spread.

double  0.01  [ [ -∞  ∞ ] ]


--indel-heterozygosity

Heterozygosity for indel calling. See the GATKDocs for heterozygosity for full details on the meaning of this population genetics concept
This argument informs the prior probability of having an indel at a site.

double  1.25E-4  [ [ -∞  ∞ ] ]


--indel-size-to-eliminate-in-ref-model

The size of an indel to check for in the reference model
This parameter determines the maximum size of an indel considered as potentially segregating in the reference model. It is used to eliminate reads from being indel informative at a site, and determines by that mechanism the certainty in the reference base. Conceptually, setting this parameter to X means that each informative read is consistent with any indel of size X being present at a specific position in the genome, given its alignment to the reference.

int  10  [ [ -∞  ∞ ] ]


--input / -I

BAM/SAM/CRAM file containing reads

R List[GATKPath]  []


--interval-exclusion-padding / -ixp

Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-merging-rule / -imr

Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not actually overlap) into a single continuous interval. However you can change this behavior if you want them to be treated as separate intervals instead.

The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:

ALL
OVERLAPPING_ONLY

IntervalMergingRule  ALL


--interval-padding / -ip

Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when analyzing exomes.

int  0  [ [ -∞  ∞ ] ]


--interval-set-rule / -isr

Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will always be merged using UNION). Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.

The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:

UNION
Take the union of all intervals
INTERSECTION
Take the intersection of intervals (the subset that overlaps all intervals specified)

IntervalSetRule  UNION


--intervals / -L

One or more genomic intervals over which to operate

List[String]  []


--inverted-read-filter / -XRF

Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters).

List[String]  []


--keep-boundary-flows

prevent spreading of boundary flows.

boolean  false


--kmer-size

Kmer size to use in the read threading assembler
Multiple kmer sizes can be specified, using e.g. `--kmer-size 10 --kmer-size 25`.

List[Integer]  [10, 25]


--likelihood-calculation-engine

What likelihood calculation engine to use to calculate the relative likelihood of reads vs haplotypes

The --likelihood-calculation-engine argument is an enumerated type (Implementation), which can have one of the following values:

PairHMM
Classic full pair-hmm all haplotypes vs all reads.
FlowBased
FlowBasedHMM

Implementation  PairHMM


--linked-de-bruijn-graph

If enabled, the Assembly Engine will construct a Linked De Bruijn graph to recover better haplotypes
Disables graph simplification into a seq graph, opts to construct a proper De Bruijn graph with potential loops NOTE: --linked-de-bruijn-graph is currently an experimental feature that does not directly match with the regular HaplotypeCaller. Specifically the haplotype finding code does not perform correctly at complicated sites. Use this mode at your own risk.

boolean  false


--mapping-quality-threshold-for-genotyping

Control the threshold for discounting reads from the genotyper due to mapping quality after the active region detection and assembly steps but before genotyping. NOTE: this is in contrast to the --minimum-mapping-quality argument which filters reads from all parts of the HaplotypeCaller. If you would like to call genotypes with a different threshold both arguments must be set.

int  20  [ [ -∞  ∞ ] ]


--max-alternate-alleles

Maximum number of alternate alleles to genotype
If there are more than this number of alternate alleles presented to the genotyper (either through discovery or GENOTYPE_GIVEN ALLELES), then only this many alleles will be used. Note that genotyping sites with many alternate alleles is both CPU and memory intensive and it scales exponentially based on the number of alternate alleles. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. See also and . This value can be no greater than one less than the corresponding GenomicsDB argument. Sites that exceed the GenomicsDB alt allele max will not be output with likelihoods and will be dropped by GenotypeGVCFs.

int  6  [ [ -∞  ∞ ] ]


--max-assembly-region-size

Maximum size of an assembly region

int  300  [ [ -∞  ∞ ] ]


--max-effective-depth-adjustment-for-frd

Set the maximum depth to modify FRD adjustment to in the event of high depth sites (0 to disable)

int  0  [ [ -∞  ∞ ] ]


--max-genotype-count

Maximum number of genotypes to consider at any site
If there are more than this number of genotypes at a locus presented to the genotyper, then only this many genotypes will be used. The possible genotypes are simply different ways of partitioning alleles given a specific ploidy assumption. Therefore, we remove genotypes from consideration by removing alternate alleles that are the least well supported. The estimate of allele support is based on the ranking of the candidate haplotypes coming out of the graph building step. Note that the reference allele is always kept. Note that genotyping sites with large genotype counts is both CPU and memory intensive. Unless there is a good reason to change the default value, we highly recommend that you not play around with this parameter. The maximum number of alternative alleles used in the genotyping step will be the lesser of the two: 1. the largest number of alt alleles, given ploidy, that yields a genotype count no higher than 2. the value of See also and

int  1024  [ [ -∞  ∞ ] ]


--max-mnp-distance / -mnp-dist

Two or more phased substitutions separated by this distance or less are merged into MNPs.
Two or more phased substitutions separated by this distance or less are merged into MNPs.

int  0  [ [ -∞  ∞ ] ]


--max-num-haplotypes-in-population

Maximum number of haplotypes to consider for your population
The assembly graph can be quite complex, and could imply a very large number of possible haplotypes. Each haplotype considered requires N PairHMM evaluations if there are N reads across all samples. In order to control the run of the haplotype caller we only take maxNumHaplotypesInPopulation paths from the graph, in order of their weights, no matter how many paths are possible to generate from the graph. Putting this number too low will result in dropping true variation because paths that include the real variant are not even considered. You can consider increasing this number when calling organisms with high heterozygosity.

int  128  [ [ -∞  ∞ ] ]


--max-prob-propagation-distance

Upper limit on how many bases away probability mass can be moved around when calculating the boundaries between active and inactive assembly regions

int  50  [ [ -∞  ∞ ] ]


--max-reads-per-alignment-start

Maximum number of reads to retain per alignment start position. Reads above this threshold will be downsampled. Set to 0 to disable.
Other parameters

int  50  [ [ -∞  ∞ ] ]


--max-unpruned-variants

Maximum number of variants in graph the adaptive pruner will allow
The maximum number of variants in graph the adaptive pruner will allow

int  100  [ [ -∞  ∞ ] ]


--min-assembly-region-size

Minimum size of an assembly region
Parameters that control active regions

int  50  [ [ -∞  ∞ ] ]


--min-base-quality-score / -mbq

Minimum base quality required to consider a base for calling
Bases with a quality below this threshold will not be used for calling.

byte  10  [ [ -∞  ∞ ] ]


--min-dangling-branch-length

Minimum length of a dangling branch to attempt recovery
When constructing the assembly graph we are often left with "dangling" branches. The assembly engine attempts to rescue these branches by merging them back into the main graph. This argument describes the minimum length of a dangling branch needed for the engine to try to rescue it. A smaller number here will lead to higher sensitivity to real variation but also to a higher number of false positives.

int  4  [ [ -∞  ∞ ] ]


--min-pruning

Minimum support to not prune paths in the graph
Paths with fewer supporting kmers than the specified threshold will be pruned from the graph. Be aware that this argument can dramatically affect the results of variant calling and should only be used with great caution. Using a prune factor of 1 (or below) will prevent any pruning from the graph, which is generally not ideal; it can make the calling much slower and even less accurate (because it can prevent effective merging of "tails" in the graph). Higher values tend to make the calling much faster, but also lowers the sensitivity of the results (because it ultimately requires higher depth to produce calls).

int  2  [ [ -∞  ∞ ] ]


--native-pair-hmm-threads

How many threads should a native pairHMM implementation use

int  4  [ [ -∞  ∞ ] ]


--native-pair-hmm-use-double-precision

use double precision in the native pairHmm. This is slower but matches the java implementation better

boolean  false


--num-pruning-samples

Number of samples that must pass the minPruning threshold
If fewer samples than the specified number pass the minPruning threshold for a given path, that path will be eliminated from the graph.

int  1  [ [ -∞  ∞ ] ]


--num-reducers

For tools that shuffle data or write an output, sets the number of reducers. Defaults to 0, which gives one partition per 10MB of input.

int  0  [ [ -∞  ∞ ] ]


--num-reference-samples-if-no-call

Number of hom-ref genotypes to infer at sites not present in a panel
When a variant is not seen in any panel, this argument controls whether to infer (and with what effective strength) that only reference alleles were observed at that site. E.g. "If not seen in 1000Genomes, treat it as AC=0, AN=2000".

int  0  [ [ -∞  ∞ ] ]


--output / -O

Single file to which variants should be written

R String  null


--output-mode

Specifies which type of calls we should output

The --output-mode argument is an enumerated type (OutputMode), which can have one of the following values:

EMIT_VARIANTS_ONLY
produces calls only at variant sites
EMIT_ALL_CONFIDENT_SITES
produces calls at variant sites and confident reference sites
EMIT_ALL_ACTIVE_SITES
Produces calls at any region over the activity threshold regardless of confidence. On occasion, this will output HOM_REF records where no call could be confidently made. This does not necessarily output calls for all sites in a region. This argument is intended only for point mutations (SNPs); it will not produce a comprehensive set of indels.

OutputMode  EMIT_VARIANTS_ONLY


--output-shard-tmp-dir

when writing a bam, in single sharded mode this directory to write the temporary intermediate output shards, if not specified .parts/ will be used

Exclusion: This argument cannot be used at the same time as sharded-output.

String  null


--pair-hmm-gap-continuation-penalty

Flat gap continuation penalty for use in the Pair HMM

int  10  [ [ -∞  ∞ ] ]


--pair-hmm-implementation / -pairHMM

The PairHMM implementation to use for genotype likelihood calculations
The PairHMM implementation to use for genotype likelihood calculations. The various implementations balance a tradeoff of accuracy and runtime.

The --pair-hmm-implementation argument is an enumerated type (Implementation), which can have one of the following values:

EXACT
ORIGINAL
LOGLESS_CACHING
AVX_LOGLESS_CACHING
AVX_LOGLESS_CACHING_OMP
FASTEST_AVAILABLE

Implementation  FASTEST_AVAILABLE


--pair-hmm-results-file

File to write exact pairHMM inputs/outputs to for debugging purposes
Argument for generating a file of all of the inputs and outputs for the pair hmm

GATKPath  null


--pcr-indel-model

The PCR indel model to use
When calculating the likelihood of variants, we can try to correct for PCR errors that cause indel artifacts. The correction is based on the reference context, and acts specifically around repetitive sequences that tend to cause PCR errors). The variant likelihoods are penalized in increasing scale as the context around a putative indel is more repetitive (e.g. long homopolymer). The correction can be disabling by specifying '-pcrModel NONE'; in that case the default base insertion/deletion qualities will be used (or taken from the read if generated through the BaseRecalibrator). VERY IMPORTANT: when using PCR-free sequencing data we definitely recommend setting this argument to NONE .

The --pcr-indel-model argument is an enumerated type (PCRErrorModel), which can have one of the following values:

NONE
no specialized PCR error model will be applied; if base insertion/deletion qualities are present they will be used
HOSTILE
a most aggressive model will be applied that sacrifices true positives in order to remove more false positives
AGGRESSIVE
a more aggressive model will be applied that sacrifices true positives in order to remove more false positives
CONSERVATIVE
a less aggressive model will be applied that tries to maintain a high true positive rate at the expense of allowing more false positives

PCRErrorModel  CONSERVATIVE


--pedigree / -ped

Pedigree file for determining the population "founders"

GATKPath  null


--phred-scaled-global-read-mismapping-rate

The global assumed mismapping rate for reads
The phredScaledGlobalReadMismappingRate reflects the average global mismapping rate of all reads, regardless of their mapping quality. This term effects the probability that a read originated from the reference haplotype, regardless of its edit distance from the reference, in that the read could have originated from the reference haplotype but from another location in the genome. Suppose a read has many mismatches from the reference, say like 5, but has a very high mapping quality of 60. Without this parameter, the read would contribute 5 * Q30 evidence in favor of its 5 mismatch haplotype compared to reference, potentially enough to make a call off that single read for all of these events. With this parameter set to Q30, though, the maximum evidence against any haplotype that this (and any) read could contribute is Q30. Set this term to any negative number to turn off the global mapping rate.

int  45  [ [ -∞  ∞ ] ]


--pileup-detection

If enabled, the variant caller will create pileup-based haplotypes in addition to the assembly-based haplotype generation.
Enables pileup-based haplotype creation and variant detection NOTE: --pileup-detection is a beta feature. Use this mode at your own risk.

boolean  false


--ploidy-regions / -ploidy-regions

Interval file with column specifying desired ploidy for genotyping models. Overrides default ploidy and user-provided --ploidy argument in specific regions.

FeatureInput[NamedFeature]  null


--population-callset / -population

Callset to use in calculating genotype priors
Supporting external panel. Allele counts from this panel (taken from AC,AN or MLEAC,AN or raw genotypes) will be used to inform the frequency distribution underlying the genotype priors. These files must be VCF 4.2 spec or later. Note that unlike CalculateGenotypePosteriors, HaplotypeCaller only allows one supporting callset.

FeatureInput[VariantContext]  null


--program-name

Name of the program running

String  null


--pruning-lod-threshold

Ln likelihood ratio threshold for adaptive pruning algorithm
Log-10 likelihood ratio threshold for adaptive pruning algorithm.

double  2.302585092994046  [ [ -∞  ∞ ] ]


--pruning-seeding-lod-threshold

Ln likelihood ratio threshold for seeding subgraph of good variation in adaptive pruning algorithm
Log-10 likelihood ratio threshold for adaptive pruning algorithm.

double  9.210340371976184  [ [ -∞  ∞ ] ]


--QUIET

Whether to suppress job-summary info on System.err.

Boolean  false


--read-filter / -RF

Read filters to be applied before analysis

List[String]  []


--read-index / -read-index

Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.

List[GATKPath]  []


--read-shard-padding / -read-shard-padding

Each read shard has this many bases of extra context on each side. Read shards must have as much or more padding than assembly regions.

int  100  [ [ -∞  ∞ ] ]


--read-shard-size / -read-shard-size

Maximum size of each read shard, in bases. For good performance, this should be much larger than the maximum assembly region size.

int  5000  [ [ -∞  ∞ ] ]


--read-validation-stringency / -VS

Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  SILENT


--recover-all-dangling-branches

Recover all dangling branches
By default, the read threading assembler does not recover dangling branches that fork after splitting from the reference. This argument tells the assembly engine to recover all dangling branches.

boolean  false


--recover-dangling-heads

(Deprecated) This argument is deprecated since version 3.3
As of version 3.3, this argument is no longer needed because dangling end recovery is now the default behavior. See GATK 3.3 release notes for more details.

boolean  false


--reference / -R

Reference sequence file

R GATKPath  null


--reference-model-deletion-quality

The quality of deletion in the reference model
This parameter is determining the deletion quality in the reference confidence model.

byte  30  [ [ -∞  ∞ ] ]


--sample-name / -ALIAS

Name of single sample to use from a multi-sample bam
You can use this argument to specify that HC should process a single sample out of a multisample BAM file. This is especially useful if your samples are all in the same file but you need to run them individually through HC in -ERC GVC mode (which is the recommended usage). Note that the name is case-sensitive.

String  null


--sample-ploidy / -ploidy

Ploidy (number of chromosomes) per sample. For pooled data, set to (Number of samples in each pool * Sample Ploidy).
Sample ploidy - equivalent to number of chromoso mes per pool. In pooled experiments this should be = # of samples in pool * individual sample ploidy

int  2  [ [ -∞  ∞ ] ]


--sharded-output

For tools that write an output, write the output in multiple pieces (shards)

Exclusion: This argument cannot be used at the same time as output-shard-tmp-dir.

boolean  false


--showHidden / -showHidden

display hidden arguments

boolean  false


--shuffle / -shuffle

whether to use the shuffle implementation or not

boolean  false


--smith-waterman

Which Smith-Waterman implementation to use, generally FASTEST_AVAILABLE is the right choice

The --smith-waterman argument is an enumerated type (Implementation), which can have one of the following values:

FASTEST_AVAILABLE
use the fastest available Smith-Waterman aligner that runs on your hardware
AVX_ENABLED
use the AVX enabled Smith-Waterman aligner
JAVA
use the pure java implementation of Smith-Waterman, works on all hardware

Implementation  FASTEST_AVAILABLE


--smith-waterman-dangling-end-gap-extend-penalty

Smith-Waterman gap-extend penalty for dangling-end recovery.

int  -6  [ [ -∞  0 ] ]


--smith-waterman-dangling-end-gap-open-penalty

Smith-Waterman gap-open penalty for dangling-end recovery.

int  -110  [ [ -∞  0 ] ]


--smith-waterman-dangling-end-match-value

Smith-Waterman match value for dangling-end recovery.

int  25  [ [ 0  ∞ ] ]


--smith-waterman-dangling-end-mismatch-penalty

Smith-Waterman mismatch penalty for dangling-end recovery.

int  -50  [ [ -∞  0 ] ]


--smith-waterman-haplotype-to-reference-gap-extend-penalty

Smith-Waterman gap-extend penalty for haplotype-to-reference alignment.

int  -11  [ [ -∞  0 ] ]


--smith-waterman-haplotype-to-reference-gap-open-penalty

Smith-Waterman gap-open penalty for haplotype-to-reference alignment.

int  -260  [ [ -∞  0 ] ]


--smith-waterman-haplotype-to-reference-match-value

Smith-Waterman match value for haplotype-to-reference alignment.

int  200  [ [ 0  ∞ ] ]


--smith-waterman-haplotype-to-reference-mismatch-penalty

Smith-Waterman mismatch penalty for haplotype-to-reference alignment.

int  -150  [ [ -∞  0 ] ]


--smith-waterman-read-to-haplotype-gap-extend-penalty

Smith-Waterman gap-extend penalty for read-to-haplotype alignment.

int  -5  [ [ -∞  0 ] ]


--smith-waterman-read-to-haplotype-gap-open-penalty

Smith-Waterman gap-open penalty for read-to-haplotype alignment.

int  -30  [ [ -∞  0 ] ]


--smith-waterman-read-to-haplotype-match-value

Smith-Waterman match value for read-to-haplotype alignment.

int  10  [ [ 0  ∞ ] ]


--smith-waterman-read-to-haplotype-mismatch-penalty

Smith-Waterman mismatch penalty for read-to-haplotype alignment.

int  -15  [ [ -∞  0 ] ]


--soft-clip-low-quality-ends

If enabled will preserve low-quality read ends as softclips (used for DRAGEN-GATK BQD genotyper model)

boolean  false


--spark-master

URL of the Spark Master to submit jobs to when using the Spark pipeline runner.

String  local[*]


--spark-verbosity

Spark verbosity. Overrides --verbosity for Spark-generated logs only. Possible values: {ALL, DEBUG, INFO, WARN, ERROR, FATAL, OFF, TRACE}

String  null


--splitting-index-granularity

Granularity to use when writing a splitting index, one entry will be put into the index every n reads where n is this granularity value. Smaller granularity results in a larger index with more available split points.

long  4096  [ [ 1  ∞ ] ]


--standard-min-confidence-threshold-for-calling / -stand-call-conf

The minimum phred-scaled confidence threshold at which variants should be called
The minimum phred-scaled confidence threshold at which variants should be called. Only variant sites with QUAL equal or greater than this threshold will be called. Note that since version 3.7, we no longer differentiate high confidence from low confidence calls at the calling step. The default call confidence threshold is set low intentionally to achieve high sensitivity, which will allow false positive calls as a side effect. Be sure to perform some kind of filtering after calling to reduce the amount of false positives in your final callset. Note that when HaplotypeCaller is used in GVCF mode (using either -ERC GVCF or -ERC BP_RESOLUTION) the call threshold is automatically set to zero. Call confidence thresholding will then be performed in the subsequent GenotypeGVCFs command. Note that the default was changed from 10.0 to 30.0 in version 4.1.0.0 to accompany the switch to use the the new quality score by default.

double  30.0  [ [ -∞  ∞ ] ]


--strict

whether to use the strict implementation or not (defaults to the faster implementation that doesn't strictly match the walker version)

boolean  false


--tmp-dir

Temp directory to use.

GATKPath  null


--transform-dragen-mapping-quality

If enabled this argument will map DRAGEN aligner aligned reads with mapping quality <=250 to scale up to MQ 50

boolean  false


--use-filtered-reads-for-annotations

Use the contamination-filtered read maps for the purposes of annotating variants

boolean  false


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--use-new-qual-calculator / -new-qual

(Deprecated) Use the new AF model instead of the so-called exact model
As of version 4.1.0.0, this argument is no longer needed because the new qual score is now on by default. See GATK 3.3 release notes for more details.

boolean  true


--use-nio

Whether to use NIO or the Hadoop filesystem (default) for reading files. (Note that the Hadoop filesystem is always used for writing files.)

boolean  false


--use-pdhmm

Partially Determined HMM, an alternative to the regular assembly haplotypes where we instead construct artificial haplotypes out of the union of the assembly and pileup alleles.
This argument enables the PartiallyDeterminedHMM. By enabling this triggers the HaplotypeCaller (not currently supported in Mutect2) to use the PDHMM to compute likelihoods instead of the PairHMM. This means that variants found by both pileupdetection and the assembly engine are going to be treated as equivalent and merged/filtered together to produce "PartiallyDetermined" Haplotype objects and produce a merged likelihoods score from multiple haplotypes being run together. This code is intended for Dragen 3.7.8 concordance and is not recommended to be run outside of that context without being optimized.

boolean  false


--use-pdhmm-overlap-optimization

PDHMM: An optimization to PDHMM, if set this will skip running PDHMM haplotype determination on reads that don't overlap (within a few bases) of the determined allele in each haplotype. This substantially reduces the amount of read-haplotype comparisons at the expense of ignoring read realignment mapping artifacts. (Requires '--use-pdhmm' argument)

boolean  false


--use-posteriors-to-calculate-qual / -gp-qual

if available, use the genotype posterior probabilities to calculate the site QUAL

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.