Functional Annotator
This tool is a functional annotation tool that allows a user to add annotations to called variants based on a set of data sources, each with its own matching criteria.
Detailed information and a tutorial can be found here:
Data sources are expected to be in folders that are specified as input arguments. While multiple data source folders can be specified, no two data sources can have the same name.
In each main data source folder, there should be sub-directories for each individual data source, with further sub-directories for a specific reference (i.e. hg19 or hg38). In the reference-specific data source directory, there is a configuration file detailing information about the data source and how to match it to a variant. This configuration file is required.
An example of a data source directory is the following:
dataSourcesFolder/
Data_Source_1/
hg19
data_source_1.config
data_source_1.data.file.one
data_source_1.data.file.two
data_source_1.data.file.three
...
hg38
data_source_1.config
data_source_1.data.file.one
data_source_1.data.file.two
data_source_1.data.file.three
...
Data_Source_2/
hg19
data_source_2.config
data_source_2.data.file.one
data_source_2.data.file.two
data_source_2.data.file.three
...
hg38
data_source_2.config
data_source_2.data.file.one
data_source_2.data.file.two
data_source_2.data.file.three
...
...
The GATK includes two sets of pre-packaged data sources, allowing for {@link Funcotator} use without (much) additional configuration.
These data source packages correspond to the germline and somatic use cases.
Broadly speaking, if you have a germline VCF, the germline data sources are what you want to use to start with.
Conversely, if you have a somatic VCF, the somatic data sources are what you want to use to start with.
Versioned gzip archives of data source files are provided here:
The pre-packaged data sources include gnomAD, a large database of known variants. gnomAD is split into two parts - one based on exome data, one based on whole genome data.
Due to the size of gnomAD, it cannot be included in the data sources package directly. Instead, the configuration data are present and point to a Google bucket in which
the gnomAD data reside. This will cause {@link Funcotator} to actively connect to that bucket when it is run.
For this reason, gnomAD is disabled by default.
To enable gnomAD, simply change directories to your data sources directory and untar the gnomAD tar.gz files:
cd DATA_SOURCES_DIR
tar -zxf gnomAD_exome.tar.gz
tar -zxf gnomAD_genome.tar.gz
Because {@link Funcotator} will query the Internet when gnomAD is enabled, performance will be impacted by the machine's Internet connection speed. If this degradation is significant, you can localize gnomAD to the machine running {@link Funcotator} to improve performance (however due to the size of gnomAD this may be impractical).
To improve ease-of-use of {@link Funcotator}, there is a tool to download the pre-packaged data sources to the user's machine.
This tool is the {@link FuncotatorDataSourceDownloader} and can be run to retrieve the pre-packaged data sources from the google bucket and localize them to the machine on which it is run.
Briefly:
{@code ./gatk FuncotatorDataSourceDownloader --somatic --validate-integrity --extract-after-download}{@code ./gatk FuncotatorDataSourceDownloader --germline --validate-integrity --extract-after-download}A data source can be disabled by removing the folder containing the configuration file for that source. This can be done on a per-reference basis. If the entire data source should be disabled, the entire top-level data source folder can be removed.
If it is possible that the data source will be re-enabled in the future, then we recommend zipping the data source folder and removing the folder itself, leaving only the zip file in its place. When the time comes to enable data source again, simply unzip the file and the data source will be ready to go the next time {@link Funcotator} is run.
Users can define their own data sources by creating a new correctly-formatted data source sub-directory in the main data sources folder. In this sub-directory, the user must create an additional folder for the reference for which the data source is valid. If the data source is valid for multiple references, then multiple reference folders should be created. Inside each reference folder, the user should place the file(s) containing the data for the data source. Additionally the user must create a configuration file containing metadata about the data source.
{@link Funcotator} allows for data sources with source files that live on the cloud, enabling users to annotate with data sources that are not physically present on the machines running {@link Funcotator}.
To create a data source based on the cloud, create a configuration file for that data source and put the cloud URL in as the src_file property (see Configuration File Format for details).
E.g.:
...
src_file = gs://broad-references/hg19/v0/1000G_phase1.snps.high_confidence.b37.vcf.gz
...
There are several formats allowed for data sources, however the two most useful are arbitrarily separated value (XSV) files, such as comma-separated value (CSV), tab-separated value (TSV). These files contain a table of data that can be matched to a variant by gene name, transcript ID, or genome position. In the case of gene name and transcript ID, one column must contain the gene name or transcript ID for each row's data.
The configuration file is a standard Java properties-style configuration file with key-value pairs. This file name must end in .config.
The following is an example of a genome position XSV configuration file (for the ORegAnno data source):
name = Oreganno
version = 20160119
src_file = oreganno.tsv
origin_location = http://www.oreganno.org/dump/ORegAnno_Combined_2016.01.19.tsv
preprocessing_script = getOreganno.py
# Supported types:
# simpleXSV -- Arbitrary separated value table (e.g. CSV), keyed off Gene Name OR Transcript ID
# locatableXSV -- Arbitrary separated value table (e.g. CSV), keyed off a genome location
# gencode -- Custom datasource class for GENCODE
# cosmic -- Custom datasource class for COSMIC
# vcf -- Custom datasource class for Variant Call Format (VCF) files
type = locatableXSV
# Required field for GENCODE files.
# Path to the FASTA file from which to load the sequences for GENCODE transcripts:
gencode_fasta_path =
# Required field for GENCODE files.
# NCBI build version (either hg19 or hg38):
ncbi_build_version =
# Required field for simpleXSV files.
# Valid values:
# GENE_NAME
# TRANSCRIPT_ID
xsv_key =
# Required field for simpleXSV files.
# The 0-based index of the column containing the key on which to match
xsv_key_column =
# Required field for simpleXSV AND locatableXSV files.
# The delimiter by which to split the XSV file into columns.
xsv_delimiter = \t
# Required field for simpleXSV files.
# Whether to permissively match the number of columns in the header and data rows
# Valid values:
# true
# false
xsv_permissive_cols = true
# Required field for locatableXSV files.
# The 0-based index of the column containing the contig for each row
contig_column = 1
# Required field for locatableXSV files.
# The 0-based index of the column containing the start position for each row
start_column = 2
# Required field for locatableXSV files.
# The 0-based index of the column containing the end position for each row
end_column = 3
The basic output of {@link Funcotator} is:
The pre-packaged data sources will create a set of baseline, or default annotations for an input data set. Most of these data sources copy and paste values from their source files into the output of {@link Funcotator} to create annotations. In this sense they are trivial data sources.
{@link Funcotator} performs some processing on the input data to create the Gencode annotations. Gencode is currently required, so {@link Funcotator} will create these annotations for all input variants. The order and a specification of the Gencode annotations that {@link Funcotator} creates is as follows:
COULD_NOT_DETERMINE
Variant classification could not be determined.
INTRON
Variant lies between exons within the bounds of the chosen transcript.
Only valid for Introns.
FIVE_PRIME_UTR
Variant is on the 5'UTR for the chosen transcript.
Only valid for UTRs.
THREE_PRIME_UTR
Variant is on the 3'UTR for the chosen transcript
Only valid for UTRs.
IGR
Intergenic region. Does not overlap any transcript.
Only valid for IGRs.
FIVE_PRIME_FLANK
The variant is upstream of the chosen transcript
Only valid for IGRs.
THREE_PRIME_FLANK
The variant is downstream of the chosen transcript
Only valid for IGRs.
MISSENSE
The point mutation alters the protein structure by one amino acid.
Can occur in Coding regions or Introns.
NONSENSE
A premature stop codon is created by the variant.
Can occur in Coding regions or Introns.
NONSTOP
Variant removes stop codon.
Can occur in Coding regions or Introns.
SILENT
Variant is in coding region of the chosen transcript, but protein structure is identical.
Can occur in Coding regions or Introns.
SPLICE_SITE
The variant is within a configurable number of bases of a splice site. See the secondary classification to determine if it lies on the exon or intron side.
Can occur in Coding regions or Introns.
IN_FRAME_DEL
Deletion that keeps the sequence in frame.
Can occur in Coding regions or Introns.
IN_FRAME_INS
Insertion that keeps the sequence in frame.
Can occur in Coding regions or Introns.
FRAME_SHIFT_INS
Insertion that moves the coding sequence out of frame.
Can occur in Coding regions or Introns.
FRAME_SHIFT_DEL
Deletion that moves the sequence out of frame.
Can occur in Coding regions or Introns.
START_CODON_SNP
Point mutation that overlaps the start codon.
Can occur in Coding regions or Introns.
START_CODON_INS
Insertion that overlaps the start codon.
Can occur in Coding regions or Introns.
START_CODON_DEL
Deletion that overlaps the start codon.
Can occur in Coding regions or Introns.
DE_NOVO_START_IN_FRAME
New start codon is created by the given variant using the chosen transcript.
However, it is in frame relative to the coded protein, meaning that if the coding sequence were extended
then the new start codon would be in frame with the
existing start and stop codons.
This can only occur in a 5' UTR.
DE_NOVO_START_OUT_FRAME
New start codon is created by the given variant using the chosen transcript.
However, it is out of frame relative to the coded protein, meaning that if the coding sequence were extended
then the new start codon would NOT be in frame with
the existing start and stop codons.
This can only occur in a 5' UTR.
RNA
Variant lies on one of the RNA transcripts.
(special catch-all case)
LINCRNA
Variant lies on one of the lincRNAs.
(special catch-all case)
g.[CONTIG]:[POSITION][BASES CHANGED]
The format of this field slightly varies based on {@code VariantType}:
g.[CONTIG]:[POSITION OF BASE PRIOR TO INSERTION];_[POSITION OF BASE AFTER INSERTION]ins[BASES INSERTED]
g.chr19:2018023_2018024insAATCG
g.[CONTIG]:[POSITION OF BASE DELETED]del[BASE DELETED]
g.chr19:2018023delT
g.[CONTIG]:[POSITION OF FIRST BASE DELETED]_[POSITION OF LAST BASE DELETED]del[BASES DELETED]
g.chr19:2018023_2018025delTTG
g.[CONTIG]:[POSITION OF BASE ALTERED][REFERENCE BASE]>[ALTERNATE BASE]
g.chr19:2018023T>G
g.[CONTIG]:[POSITION OF FIRST BASE ALTERED]_[POSITION OF LAST BASE ALTERED][REFERENCE BASES>>[ALTERNATE BASES]
g.chr19:2018023_2018025TTG>GAT
[START]_[END]
E.g.: 1236_1237
c.[POSITION][BASES CHANGED]
The format of this field slightly varies based on {@code VariantType}, the number of affected bases, and whether the variant allele is a SPLICE_SITE:
c.[POSITION OF BASE PRIOR TO INSERTION]_[POSITION OF BASE AFTER INSERTION]ins[BASES INSERTED]
c.2018_2019insAA
c.[POSITION OF BASE DELETED]del[BASE DELETED]
c2018delT
c.[POSITION OF FIRST BASE DELETED]_[POSITION OF LAST BASE DELETED]del[BASES DELETED]
c2018_2022delTTCAG
c.[POSITION OF BASE CHANGED]>[NEW BASE]
c.1507T>G
c.[POSITION OF FIRST BASE CHANGED]_[POSITION OF LAST BASE CHANGED]>[NEW BASES]
c.12899_12900AG>TA
c.e[EXON NUMBER][+|-][BASES FROM EXON][REF ALLELE]>[ALT ALLELE]
c.e81-4TAA>A
c.[POSITION][BASES CHANGED]
The format of this field slightly varies based on {@code VariantType}, the number of affected bases, and whether the variant allele occurs in an Intron:
c.([POSITION OF FIRST BASE IN FIRST CODON IN THE REFERENCE AFFECTED BY THIS VARIANT]-[POSITION OF LAST BASE IN LAST CODON IN THE REFERENCE AFFECTED BY THIS VARIANT][REFERENCE CODONS]>[EXPRESSED CODONS]
c.(19-21)ctt>ctCGTt
c.([POSITION OF FIRST BASE IN FIRST CODON DELETED]-[POSITION OF LAST BASE IN LAST CODON DELETED][REFERENCE CODONS]del
c.(997-999)gcadel
c.([POSITION OF FIRST BASE IN FIRST CODON DELETED]-[POSITION OF LAST BASE IN LAST CODON DELETED][REFERENCE CODONS]>[EXPRESSED CODONS]
c.(997-1002)gcactc>gtc
c.([POSITION OF FIRST BASE IN LAST CORRECTLY EXPRESSED/REFERENCE CODON]-[POSITION OF LAST BASE IN LAST CORRECTLY EXPRESSED/REFERENCE CODON][REFERENCE CODONS]>[EXPRESSED CODONS]
c.(997-999)gcafs
c.([POSITION OF FIRST BASE IN FIRST CODON IN THE REFERENCE AFFECTED BY THIS VARIANT]-[POSITION OF LAST BASE IN LAST CODON IN THE REFERENCE AFFECTED BY THIS VARIANT][REFERENCE CODONS]>[EXPRESSED CODONS]
c.(39871-39873)cCC>cTT
c.(4-9)ctAAgc>ctGCgc
p.[REFERENCE AMINO ACID][POSITION][PREDICTED EXPRESSED AMINO ACID]
p.V5T
p.R2R
p.[FIRST AFFECTED AMINO ACID POSITION]_[LAST AFFECTED AMINO ACID POSITION][REFERENCE AMINO ACIDS]>[PREDICTED EXPRESSED AMINO ACIDS]
p.100_101Q*>FL
[REF ALLELE]
|
v
GAACCCACGTCGGTGAGGGCC
|________| |________|
v v
10 bases 10 bases
(window size) (window size)
Strand-correct specifically means that if the strand of this transcript is determined to be '-' then the sequence is reverse complemented.
[REF ALLELE]
|
v
CACGAAAGTCTTGCGGATCT
|________| |________|
v v
10 bases 10 bases
(window size) (window size)
[HUGO SYMBOL]_[TRANSCRIPT ID]_[VARIANT CLASSIFICATION]_[PROTEIN CHANGE STRING]
SDF4_ENST00000263741.7_MISSENSE_p.R243QIf another transcript were to be an IGR, the other transcript field would be populated with 'IGR_ANNOTATON'
SDF4_ENST00000263741.7_MISSENSE_p.R243Q/TNFRSF4_ENST00000379236.3_FIVE_PRIME_FLANKIf this variant alternate allele occurs in only one transcript, this field will be empty.
Other annotations will follow the Gencode annotations and will be based on the data sources included in the data sources directory.
./gatk Funcotator \ -R reference.fasta \ -V input.vcf \ -O outputFile \ --output-file-format MAF \ --data-sources-path dataSourcesFolder/ \ --ref-version hg19
A complete list of known open issues can be found on the GATK github entry for funcotator here.
This Read Filter is automatically applied to the data by the Engine before processing by Funcotator.
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
| Argument name(s) | Default value | Summary | |
|---|---|---|---|
| Required Arguments | |||
| --data-sources-path |
The path to a data source folder for Funcotator. May be specified more than once to handle multiple data source folders. | ||
| --output -O |
Output file to which annotated variants should be written. | ||
| --output-file-format |
The output file format. Either VCF, MAF, or SEG. Please note that MAF output for germline use case VCFs is unsupported. SEG will generate two output files: a simple tsv and a gene list. | ||
| --ref-version |
The version of the Human Genome reference to use (e.g. hg19, hg38, etc.). This will correspond to a sub-folder of each data source corresponding to that data source for the given reference. | ||
| --reference -R |
Reference sequence file | ||
| --variant -V |
A VCF file containing variants | ||
| Optional Tool Arguments | |||
| --annotation-default |
Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format |
||
| --annotation-override |
Override values for annotations (in the format |
||
| --arguments_file |
read one or more arguments files and add them to the command line | ||
| --cloud-index-prefetch-buffer -CIPB |
-1 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset. | |
| --cloud-prefetch-buffer -CPB |
40 | Size of the cloud-only prefetch buffer (in MB; 0 to disable). | |
| --custom-variant-classification-order |
TSV File containing custom Variant Classification severity map of the form: VARIANT_CLASSIFICATION SEV. VARIANT_CLASSIFICATION must match one of the VariantClassification names (COULD_NOT_DETERMINE, INTRON, FIVE_PRIME_UTR, THREE_PRIME_UTR, IGR, FIVE_PRIME_FLANK, THREE_PRIME_FLANK, MISSENSE, NONSENSE, NONSTOP, SILENT, SPLICE_SITE, IN_FRAME_DEL, IN_FRAME_INS, FRAME_SHIFT_INS, FRAME_SHIFT_DEL, START_CODON_SNP, START_CODON_INS, START_CODON_DEL, DE_NOVO_START_IN_FRAME, DE_NOVO_START_OUT_FRAME, RNA, LINCRNA). SEV is an unsigned integer, where lower is sorted first. When using this option it is HIGHLY recommended you also use the `BEST_EFFECT` transcript selection mode. | ||
| --disable-bam-index-caching -DBIC |
false | If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified. | |
| --disable-sequence-dictionary-validation |
false | If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk! | |
| --exclude-field |
Fields that should not be rendered in the final output. Only exact name matches will be excluded. | ||
| --five-prime-flank-size |
5000 | Variants within this many bases of the 5' end of a transcript (and not overlapping any part of the transcript itself) will be annotated as being in the 5' flanking region of that transcript | |
| --gcs-max-retries -gcs-retries |
20 | If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection | |
| --gcs-project-for-requester-pays |
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed. | ||
| --help -h |
false | display the help message | |
| --interval-merging-rule -imr |
ALL | Interval merging rule for abutting intervals | |
| --intervals -L |
One or more genomic intervals over which to operate | ||
| --lookahead-cache-bp |
100000 | Number of base-pairs to cache when querying variants. Can be overridden in individual data source configuration files. | |
| --reannotate-vcf |
false | When input VCF has already been annotated, still annotate again. | |
| --remove-filtered-variants |
false | Ignore/drop variants that have been filtered in the input. These variants will not appear in the output file. | |
| --sites-only-vcf-output |
false | If true, don't emit genotype fields when writing vcf file output. | |
| --splice-site-window-size |
2 | Number of bases on either side of a splice site for a variant to be classified as a SPLICE_SITE variant (default: 2). | |
| --three-prime-flank-size |
0 | Variants within this many bases of the 3' end of a transcript (and not overlapping any part of the transcript itself) will be annotated as being in the 3' flanking region of that transcript | |
| --transcript-list |
File to use as a list of transcripts (one transcript ID per line, version numbers are ignored) OR A set of transcript IDs to use for annotation to override selected transcript. | ||
| --transcript-selection-mode |
CANONICAL | Method of detailed transcript selection. This will select the transcript for detailed annotation (CANONICAL, ALL, or BEST_EFFECT). | |
| --version |
false | display the version number for this tool | |
| Optional Common Arguments | |||
| --add-output-sam-program-record |
true | If true, adds a PG tag to created SAM/BAM/CRAM files. | |
| --add-output-vcf-command-line |
true | If true, adds a command line header line to created VCF files. | |
| --create-output-bam-index -OBI |
true | If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file. | |
| --create-output-bam-md5 -OBM |
false | If true, create a MD5 digest for any BAM/SAM/CRAM file created | |
| --create-output-variant-index -OVI |
true | If true, create a VCF index when writing a coordinate-sorted VCF file. | |
| --create-output-variant-md5 -OVM |
false | If true, create a a MD5 digest any VCF file created. | |
| --disable-read-filter -DF |
Read filters to be disabled before analysis | ||
| --disable-tool-default-read-filters |
false | Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on) | |
| --exclude-intervals -XL |
One or more genomic intervals to exclude from processing | ||
| --gatk-config-file |
A configuration file to use with the GATK. | ||
| --input -I |
BAM/SAM/CRAM file containing reads | ||
| --interval-exclusion-padding -ixp |
0 | Amount of padding (in bp) to add to each interval you are excluding. | |
| --interval-padding -ip |
0 | Amount of padding (in bp) to add to each interval you are including. | |
| --interval-set-rule -isr |
UNION | Set merging approach to use for combining interval inputs | |
| --inverted-read-filter -XRF |
Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters). | ||
| --lenient -LE |
false | Lenient processing of VCF files | |
| --max-variants-per-shard |
0 | If non-zero, partitions VCF output into shards, each containing up to the given number of records. | |
| --QUIET |
false | Whether to suppress job-summary info on System.err. | |
| --read-filter -RF |
Read filters to be applied before analysis | ||
| --read-index |
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically. | ||
| --read-validation-stringency -VS |
SILENT | Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
| --seconds-between-progress-updates |
10.0 | Output traversal statistics every time this many seconds elapse | |
| --sequence-dictionary |
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file. | ||
| --tmp-dir |
Temp directory to use. | ||
| --use-jdk-deflater -jdk-deflater |
false | Whether to use the JdkDeflater (as opposed to IntelDeflater) | |
| --use-jdk-inflater -jdk-inflater |
false | Whether to use the JdkInflater (as opposed to IntelInflater) | |
| --verbosity |
INFO | Control verbosity of logging. | |
| Advanced Arguments | |||
| --min-num-bases-for-segment-funcotation |
150 | The minimum number of bases for a variant to be annotated as a segment. Recommended to be changed only for use with FuncotateSegments. Defaults to 150 | |
| --prefer-mane-transcripts |
false | If this flag is set, Funcotator will prefer 'MANE_Plus_Clinical' followed by 'MANE_select' transcripts (including those not tagged 'basic') if one is present for a given variant. If neither tag is present it use the default behavior (only base transcripts). | |
| --showHidden |
false | display hidden arguments | |
| --variant-output-filtering |
Restrict the output variants to ones that match the specified intervals according to the specified matching mode. | ||
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
If true, adds a PG tag to created SAM/BAM/CRAM files.
boolean true
If true, adds a command line header line to created VCF files.
boolean true
Annotations to include in all annotated variants if the annotation is not specified in the data sources (in the format
List[String] []
Override values for annotations (in the format
List[String] []
read one or more arguments files and add them to the command line
List[File] []
Size of the cloud-only prefetch buffer (in MB; 0 to disable). Defaults to cloudPrefetchBuffer if unset.
int -1 [ [ -∞ ∞ ] ]
Size of the cloud-only prefetch buffer (in MB; 0 to disable).
int 40 [ [ -∞ ∞ ] ]
If true, create a BAM/CRAM index when writing a coordinate-sorted BAM/CRAM file.
boolean true
If true, create a MD5 digest for any BAM/SAM/CRAM file created
boolean false
If true, create a VCF index when writing a coordinate-sorted VCF file.
boolean true
If true, create a a MD5 digest any VCF file created.
boolean false
TSV File containing custom Variant Classification severity map of the form: VARIANT_CLASSIFICATION SEV. VARIANT_CLASSIFICATION must match one of the VariantClassification names (COULD_NOT_DETERMINE, INTRON, FIVE_PRIME_UTR, THREE_PRIME_UTR, IGR, FIVE_PRIME_FLANK, THREE_PRIME_FLANK, MISSENSE, NONSENSE, NONSTOP, SILENT, SPLICE_SITE, IN_FRAME_DEL, IN_FRAME_INS, FRAME_SHIFT_INS, FRAME_SHIFT_DEL, START_CODON_SNP, START_CODON_INS, START_CODON_DEL, DE_NOVO_START_IN_FRAME, DE_NOVO_START_OUT_FRAME, RNA, LINCRNA). SEV is an unsigned integer, where lower is sorted first. When using this option it is HIGHLY recommended you also use the `BEST_EFFECT` transcript selection mode.
GATKPath null
The path to a data source folder for Funcotator. May be specified more than once to handle multiple data source folders.
R List[String] []
If true, don't cache bam indexes, this will reduce memory requirements but may harm performance if many intervals are specified. Caching is automatically disabled if there are no intervals specified.
boolean false
Read filters to be disabled before analysis
List[String] []
If specified, do not check the sequence dictionaries from our inputs for compatibility. Use at your own risk!
boolean false
Disable all tool default read filters (WARNING: many tools will not function correctly without their default read filters on)
boolean false
Fields that should not be rendered in the final output. Only exact name matches will be excluded.
Set[String] []
One or more genomic intervals to exclude from processing
Use this argument to exclude certain parts of the genome from the analysis (like -L, but the opposite). This argument can be specified multiple times. You can use samtools-style intervals either explicitly on the
command line (e.g. -XL 1 or -XL 1:100-200) or by loading in a file containing a list of intervals
(e.g. -XL myFile.intervals). strings gathered from the command line -XL argument to be parsed into intervals to exclude
List[String] []
Variants within this many bases of the 5' end of a transcript (and not overlapping any part of the transcript itself) will be annotated as being in the 5' flanking region of that transcript
int 5000 [ [ -∞ ∞ ] ]
A configuration file to use with the GATK.
String null
If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
int 20 [ [ -∞ ∞ ] ]
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
String ""
display the help message
boolean false
BAM/SAM/CRAM file containing reads
List[GATKPath] []
Amount of padding (in bp) to add to each interval you are excluding.
Use this to add padding to the intervals specified using -XL. For example, '-XL 1:100' with a
padding value of 20 would turn into '-XL 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
Interval merging rule for abutting intervals
By default, the program merges abutting intervals (i.e. intervals that are directly side-by-side but do not
actually overlap) into a single continuous interval. However you can change this behavior if you want them to be
treated as separate intervals instead.
The --interval-merging-rule argument is an enumerated type (IntervalMergingRule), which can have one of the following values:
IntervalMergingRule ALL
Amount of padding (in bp) to add to each interval you are including.
Use this to add padding to the intervals specified using -L. For example, '-L 1:100' with a
padding value of 20 would turn into '-L 1:80-120'. This is typically used to add padding around targets when
analyzing exomes.
int 0 [ [ -∞ ∞ ] ]
Set merging approach to use for combining interval inputs
By default, the program will take the UNION of all intervals specified using -L and/or -XL. However, you can
change this setting for -L, for example if you want to take the INTERSECTION of the sets instead. E.g. to
perform the analysis only on chromosome 1 exomes, you could specify -L exomes.intervals -L 1 --interval-set-rule
INTERSECTION. However, it is not possible to modify the merging approach for intervals passed using -XL (they will
always be merged using UNION).
Note that if you specify both -L and -XL, the -XL interval set will be subtracted from the -L interval set.
The --interval-set-rule argument is an enumerated type (IntervalSetRule), which can have one of the following values:
IntervalSetRule UNION
One or more genomic intervals over which to operate
List[String] []
Inverted (with flipped acceptance/failure conditions) read filters applied before analysis (after regular read filters).
List[String] []
Lenient processing of VCF files
boolean false
Number of base-pairs to cache when querying variants. Can be overridden in individual data source configuration files.
int 100000 [ [ 0 ∞ ] ]
If non-zero, partitions VCF output into shards, each containing up to the given number of records.
int 0 [ [ 0 ∞ ] ]
The minimum number of bases for a variant to be annotated as a segment. Recommended to be changed only for use with FuncotateSegments. Defaults to 150
int 150 [ [ -∞ ∞ ] ]
Output file to which annotated variants should be written.
R File null
The output file format. Either VCF, MAF, or SEG. Please note that MAF output for germline use case VCFs is unsupported. SEG will generate two output files: a simple tsv and a gene list.
The --output-file-format argument is an enumerated type (OutputFormatType), which can have one of the following values:
R OutputFormatType null
If this flag is set, Funcotator will prefer 'MANE_Plus_Clinical' followed by 'MANE_select' transcripts (including those not tagged 'basic') if one is present for a given variant. If neither tag is present it use the default behavior (only base transcripts).
boolean false
Whether to suppress job-summary info on System.err.
Boolean false
Read filters to be applied before analysis
List[String] []
Indices to use for the read inputs. If specified, an index must be provided for every read input and in the same order as the read inputs. If this argument is not specified, the path to the index for each input will be inferred automatically.
List[GATKPath] []
Validation stringency for all SAM/BAM/CRAM/SRA files read by this program. The default stringency value SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --read-validation-stringency argument is an enumerated type (ValidationStringency), which can have one of the following values:
ValidationStringency SILENT
When input VCF has already been annotated, still annotate again.
boolean false
The version of the Human Genome reference to use (e.g. hg19, hg38, etc.). This will correspond to a sub-folder of each data source corresponding to that data source for the given reference.
R String null
Reference sequence file
R GATKPath null
Ignore/drop variants that have been filtered in the input. These variants will not appear in the output file.
boolean false
Output traversal statistics every time this many seconds elapse
double 10.0 [ [ -∞ ∞ ] ]
Use the given sequence dictionary as the master/canonical sequence dictionary. Must be a .dict file.
GATKPath null
display hidden arguments
boolean false
If true, don't emit genotype fields when writing vcf file output.
boolean false
Number of bases on either side of a splice site for a variant to be classified as a SPLICE_SITE variant (default: 2).
int 2 [ [ 0 ∞ ] ]
Variants within this many bases of the 3' end of a transcript (and not overlapping any part of the transcript itself) will be annotated as being in the 3' flanking region of that transcript
int 0 [ [ -∞ ∞ ] ]
Temp directory to use.
GATKPath null
File to use as a list of transcripts (one transcript ID per line, version numbers are ignored) OR A set of transcript IDs to use for annotation to override selected transcript.
Set[String] []
Method of detailed transcript selection. This will select the transcript for detailed annotation (CANONICAL, ALL, or BEST_EFFECT).
The --transcript-selection-mode argument is an enumerated type (TranscriptSelectionMode), which can have one of the following values:
TranscriptSelectionMode CANONICAL
Whether to use the JdkDeflater (as opposed to IntelDeflater)
boolean false
Whether to use the JdkInflater (as opposed to IntelInflater)
boolean false
A VCF file containing variants
R GATKPath null
Restrict the output variants to ones that match the specified intervals according to the specified matching mode.
The --variant-output-filtering argument is an enumerated type (Mode), which can have one of the following values:
Mode null
Control verbosity of logging.
The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
display the version number for this tool
boolean false
See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum
GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.