Showing tool doc from version 4.6.2.0 | The latest version is
4.6.2.0

**BETA** TrainVariantAnnotationsModel

Trains a model for scoring variant calls based on site-level annotations

Category Variant Filtering


Overview

Trains a model for scoring variant calls based on site-level annotations.

This tool is primarily intended to be used as the second step in a variant-filtering workflow that supersedes the {@link VariantRecalibrator} workflow. Given training (and optionally, calibration) sets of site-level annotations produced by {@link ExtractVariantAnnotations}, this tool can be used to train a model for scoring variant calls. For each variant type (i.e., SNP or INDEL) specified using the "--mode" argument, the tool outputs files that are either: 1) serialized scorers, each of which persists to disk a function for computing scores given subsequent annotations, or 2) HDF5 files containing a set of scores, each corresponding to training, calibration, and unlabeled sets, as appropriate.

The model files produced by this tool can in turn be provided along with a VCF file to the {@link ScoreVariantAnnotations} tool, which assigns a score to each call (with a lower score indicating that a call is more likely to be an artifact and should perhaps be filtered). Each score can also be converted to a corresponding sensitivity with respect to a calibration set, if the latter is available.

Modeling approaches

This tool can perform modeling using either a positive-only approach or a positive-unlabeled approach. In a positive-only approach, the annotation-space distribution of training sites is used to learn a function for converting annotations for subsequent sites into a score; typically, higher scores correspond to regions of annotation space that are more densely populated by training sites. In contrast, a positive-unlabeled approach attempts to additionally use unlabeled sites to better learn not only these regions of annotation space populated by training sites, but also those that are populated by sites that may be drawn from a different distribution.

A positive-only approach is likely to perform well in cases where a sufficient number of reliable training sites is available. In contrast, if 1) only a small number of reliable training sites is available, and/or 2) the reliability of the training sites is questionable (e.g., the sites may be contaminated by a non-negligible number of sequencing artifacts), then a positive-unlabeled approach may be beneficial. Further note that although {@link VariantRecalibrator} (which this tool supplants) has typically been used to implement a naive positive-unlabeled approach, a positive-only approach likely suffices in many use cases.

If a positive-only approach has been specified, then if training sites of the variant type are available:

In contrast, a positive-unlabeled approach may instead be specified by providing the "--unlabeled-annotations-hdf5" argument. Currently, this requires the use of a custom modeling backend; see below.

Modeling backends

This tool allows the use of different backends for modeling and scoring. See also below for instructions for using a custom, user-provided implementation.

Python isolation-forest backend

This backend uses scikit-learn modules to train models and scoring functions using the isolation-forest method for anomaly detection. Median imputation of missing annotation values is performed before applying the method.

This backend can be selected by specifying "--model-backend PYTHON_IFOREST" and is also currently the the default backend. It is implemented by the script at src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py, which requires that the argparse, h5py, numpy, sklearn, and dill packages be present in the Python environment; users may wish to simply use the provided GATK conda environment to ensure that the correct versions of all packages are available. See the IsolationForest documentation here as appropriate for the version of scikit-learn used in your Python environment. The hyperparameters documented there can be specified using the "--hyperparameters-json" argument; see src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest-hyperparameters.json for an example and the default values.

Note that HDF5 files may be viewed using hdfview or loaded in Python using PyTables or h5py.

Calibration sets

The choice of calibration set will determine the conversion between model scores and calibration-set sensitivities. Ideally, the calibration set should be comprised of a unbiased sample from the full distribution of true sites in annotation space; the score-sensitivity conversion can roughly be thought of as a mapping from sensitivities in [0, 1] to a contour of this annotation-space distribution. In practice, any biases in the calibration set (e.g., if it consists of high quality, previously filtered calls, which may be biased towards the high density regions of the full distribution) will be reflected in the conversion and should be taken into consideration when interpreting calibration-set sensitivities.

Inputs

Outputs

The following outputs are produced for each variant type specified by the "--mode" argument and are delineated by type-specific tags in the filename of each output, which take the form of {output-prefix}.{variant-type}.{file-suffix}. For example, scores for the SNP calibration set will be output to the {output-prefix}.snp.calibrationScores.hdf5 file.

Usage examples

Train SNP and INDEL models using the default Python IsolationForest model backend with a positive-only approach, given an input labeled-annotations HDF5 file generated by {@link ExtractVariantAnnotations} that contains labels for both training and calibration sets, producing the outputs 1) train.snp.scorer.pkl, 2) train.snp.trainingScores.hdf5, and 3) train.snp.calibrationScores.hdf5, as well as analogous files for the INDEL model. Note that the "--mode" arguments are made explicit here, although both SNP and INDEL modes are selected by default.

     gatk TrainVariantAnnotationsModel \
          --annotations-hdf5 extract.annot.hdf5 \
          --mode SNP \
          --mode INDEL \
          -O train
 

Custom modeling/scoring backends (ADVANCED)

The primary modeling functionality performed by this tool is accomplished by a "modeling backend" whose fundamental contract is to take an input HDF5 file containing an annotation matrix for sites of a single variant type (i.e., SNP or INDEL) (as well as an analogous HDF5 file for unlabeled sites, if a positive-unlabeled modeling approach has been specified) and to output a serialized scorer for that variant type. Rather than using one of the available, implemented backends, advanced users may provide their own backend via the "--python-script" argument. See documentation in the modeling and scoring interfaces ({@link VariantAnnotationsModel} and {@link VariantAnnotationsScorer}, respectively), as well as the default Python IsolationForest implementation at {@link PythonVariantAnnotationsModel} and src/main/resources/org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py.

Extremely advanced users could potentially substitute their own implementation for the entire {@link TrainVariantAnnotationsModel} tool, while still making use of the up/downstream {@link ExtractVariantAnnotations} and {@link ScoreVariantAnnotations} tools. To do so, one would additionally have to implement functionality for subsetting training/calibration sets by variant type, calling modeling backends as appropriate, and scoring calibration sets.

@author Samuel Lee <slee@broadinstitute.org>

TrainVariantAnnotationsModel specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--annotations-hdf5
HDF5 file containing annotations extracted with ExtractVariantAnnotations.
--output
 -O
Output prefix.
Optional Tool Arguments
--arguments_file
read one or more arguments files and add them to the command line
--gcs-max-retries
 -gcs-retries
20 If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection
--gcs-project-for-requester-pays
Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.
--help
 -h
false display the help message
--hyperparameters-json
JSON file containing hyperparameters. Optional if the PYTHON_IFOREST backend is used (if not specified, a default set of hyperparameters will be used); otherwise required.
--mode
[SNP, INDEL] Variant types for which to train models. Duplicate values will be ignored.
--model-backend
PYTHON_IFOREST Backend to use for training models. JAVA_BGMM will use a pure Java implementation (ported from Python scikit-learn) of the Bayesian Gaussian Mixture Model. PYTHON_IFOREST will use the Python scikit-learn implementation of the IsolationForest method and will require that the corresponding Python dependencies are present in the environment. PYTHON_SCRIPT will use the script specified by the python-script argument. See the tool documentation for more details.
--python-script
Python script used for specifying a custom scoring backend. If provided, model-backend must also be set to PYTHON_SCRIPT.
--unlabeled-annotations-hdf5
HDF5 file containing annotations extracted with ExtractVariantAnnotations. If specified, a positive-unlabeled modeling approach will be used; otherwise, a positive-only modeling approach will be used.
--version
false display the version number for this tool
Optional Common Arguments
--gatk-config-file
A configuration file to use with the GATK.
--QUIET
false Whether to suppress job-summary info on System.err.
--tmp-dir
Temp directory to use.
--use-jdk-deflater
 -jdk-deflater
false Whether to use the JdkDeflater (as opposed to IntelDeflater)
--use-jdk-inflater
 -jdk-inflater
false Whether to use the JdkInflater (as opposed to IntelInflater)
--verbosity
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--annotations-hdf5

HDF5 file containing annotations extracted with ExtractVariantAnnotations.

R File  null


--arguments_file

read one or more arguments files and add them to the command line

List[File]  []


--gatk-config-file

A configuration file to use with the GATK.

String  null


--gcs-max-retries / -gcs-retries

If the GCS bucket channel errors out, how many times it will attempt to re-initiate the connection

int  20  [ [ -∞  ∞ ] ]


--gcs-project-for-requester-pays

Project to bill when accessing "requester pays" buckets. If unset, these buckets cannot be accessed. User must have storage.buckets.get permission on the bucket being accessed.

String  ""


--help / -h

display the help message

boolean  false


--hyperparameters-json

JSON file containing hyperparameters. Optional if the PYTHON_IFOREST backend is used (if not specified, a default set of hyperparameters will be used); otherwise required.

File  null


--mode

Variant types for which to train models. Duplicate values will be ignored.

The --mode argument is an enumerated type (List[VariantType]), which can have one of the following values:

SNP
INDEL

List[VariantType]  [SNP, INDEL]


--model-backend

Backend to use for training models. JAVA_BGMM will use a pure Java implementation (ported from Python scikit-learn) of the Bayesian Gaussian Mixture Model. PYTHON_IFOREST will use the Python scikit-learn implementation of the IsolationForest method and will require that the corresponding Python dependencies are present in the environment. PYTHON_SCRIPT will use the script specified by the python-script argument. See the tool documentation for more details.

The --model-backend argument is an enumerated type (VariantAnnotationsModelBackend), which can have one of the following values:

JAVA_BGMM
PYTHON_IFOREST
Use the script at org/broadinstitute/hellbender/tools/walkers/vqsr/scalable/isolation-forest.py
PYTHON_SCRIPT
Use a user-provided script.

VariantAnnotationsModelBackend  PYTHON_IFOREST


--output / -O

Output prefix.

R String  null


--python-script

Python script used for specifying a custom scoring backend. If provided, model-backend must also be set to PYTHON_SCRIPT.

File  null


--QUIET

Whether to suppress job-summary info on System.err.

Boolean  false


--showHidden / -showHidden

display hidden arguments

boolean  false


--tmp-dir

Temp directory to use.

GATKPath  null


--unlabeled-annotations-hdf5

HDF5 file containing annotations extracted with ExtractVariantAnnotations. If specified, a positive-unlabeled modeling approach will be used; otherwise, a positive-only modeling approach will be used.

File  null


--use-jdk-deflater / -jdk-deflater

Whether to use the JdkDeflater (as opposed to IntelDeflater)

boolean  false


--use-jdk-inflater / -jdk-inflater

Whether to use the JdkInflater (as opposed to IntelInflater)

boolean  false


--verbosity / -verbosity

Control verbosity of logging.

The --verbosity argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.