Showing tool doc from version 4.6.2.0 | The latest version is
4.6.2.0

CollectSamErrorMetrics (Picard)

Program to collect error metrics on bases stratified in various ways.

Sequencing errors come in different 'flavors'. For example, some occur during sequencing while others happen during library construction, prior to the sequencing. They may be correlated with various aspect of the sequencing experiment: position in the read, base context, length of insert and so on.

This program collects two different kinds of error metrics (one which attempts to distinguish between pre- and post- sequencer errors, and on which doesn't) and a collation of 'stratifiers' each of which assigns bases into various bins. The stratifiers can be used together to generate a composite stratification.

For example:

The BASE_QUALITY stratifier will place bases in bins according to their declared base quality. The READ_ORDINALITY stratifier will place bases in one of two bins depending on whether their read is 'first' or 'second'. One could generate a composite stratifier BASE_QUALITY:READ_ORDINALITY which will do both stratifications as the same time.

The resulting metric file will be named according to a provided prefix and a suffix which is generated automatically according to the error metric. The tool can collect multiple metrics in a single pass and there should be hardly any performance loss when specifying multiple metrics at the same time; the default includes a large collection of metrics.

To estimate the error rate the tool assumes that all differences from the reference are errors. For this to be a reasonable assumption the tool needs to know the sites at which the sample is actually polymorphic and a confidence interval where the user is relatively certain that the polymorphic sites are known and accurate. These two inputs are provided as a VCF and INTERVALS. The program will only process sites that are in the intersection of the interval lists in the INTERVALS argument as long as they are not polymorphic in the VCF.

Category Diagnostics and Quality Control


Overview

Program to collect error metrics on bases stratified in various ways.

CollectSamErrorMetrics (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--INPUT
 -I
Input SAM or BAM file.
--OUTPUT
 -O
Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE.
--REFERENCE_SEQUENCE
 -R
Reference sequence file.
Optional Tool Arguments
--arguments_file
read one or more arguments files and add them to the command line
--ERROR_METRICS
[ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT, INDEL_ERROR] Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE.
--ERROR_VALUE
A fake argument used to show the options of ERROR (in ERROR_METRICS).
--FILE_EXTENSION
 -EXT
Append the given file extension to all metric file names (ex. OUTPUT.insert_size_metrics.EXT). No extension by default.
--help
 -h
false display the help message
--INTERVAL_ITERATOR
false Iterate through the file assuming it consists of a pre-created subset interval of the full genome. This enables fast processing of files with reads at disparate parts of the genome. Requires that the provided VCF file is indexed.
--INTERVALS
 -L
Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will *intersect* inputs if multiple are given. When this argument is supplied, the VCF provided must be *indexed*.
--LOCATION_BIN_SIZE
 -LBS
2500 Size of location bins. Used by the FLOWCELL_X and FLOWCELL_Y stratifiers
--LONG_HOMOPOLYMER
 -LH
6 Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier.
--MAX_LOCI
 -MAX
0 Maximum number of loci to process (or unlimited if 0).
--MIN_BASE_Q
 -BQ
20 Minimum base quality to include base.
--MIN_MAPPING_Q
 -MQ
20 Minimum mapping quality to include read.
--PRIOR_Q
 -PE
30 The prior error, in phred-scale (used for calculating empirical error rates).
--PROBABILITY
 -P
1.0 The probability of selecting a locus for analysis (for downsampling).
--PROGRESS_STEP_INTERVAL
100000 The interval between which progress will be displayed.
--STRATIFIER_VALUE
A fake argument used to show the options of STRATIFIER (in ERROR_METRICS).
--VCF
 -V
VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci.
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create an index when writing VCF or coordinate sorted BAM output.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--TMP_DIR
One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--arguments_file

read one or more arguments files and add them to the command line

List[File]  []


--COMPRESSION_LEVEL

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX

Whether to create an index when writing VCF or coordinate sorted BAM output.

Boolean  false


--CREATE_MD5_FILE

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--ERROR_METRICS

Errors to collect in the form of "ERROR(:STRATIFIER)*". To see the values available for ERROR and STRATIFIER look at the documentation for the arguments ERROR_VALUE and STRATIFIER_VALUE.

List[String]  [ERROR, ERROR:BASE_QUALITY, ERROR:INSERT_LENGTH, ERROR:GC_CONTENT, ERROR:READ_DIRECTION, ERROR:PAIR_ORIENTATION, ERROR:HOMOPOLYMER, ERROR:BINNED_HOMOPOLYMER, ERROR:CYCLE, ERROR:READ_ORDINALITY, ERROR:READ_ORDINALITY:CYCLE, ERROR:READ_ORDINALITY:HOMOPOLYMER, ERROR:READ_ORDINALITY:GC_CONTENT, ERROR:READ_ORDINALITY:PRE_DINUC, ERROR:MAPPING_QUALITY, ERROR:READ_GROUP, ERROR:MISMATCHES_IN_READ, ERROR:ONE_BASE_PADDED_CONTEXT, OVERLAPPING_ERROR, OVERLAPPING_ERROR:BASE_QUALITY, OVERLAPPING_ERROR:INSERT_LENGTH, OVERLAPPING_ERROR:READ_ORDINALITY, OVERLAPPING_ERROR:READ_ORDINALITY:CYCLE, OVERLAPPING_ERROR:READ_ORDINALITY:HOMOPOLYMER, OVERLAPPING_ERROR:READ_ORDINALITY:GC_CONTENT, INDEL_ERROR]


--ERROR_VALUE

A fake argument used to show the options of ERROR (in ERROR_METRICS).

The --ERROR_VALUE argument is an enumerated type (ErrorType), which can have one of the following values:

ERROR
Collects the average (SNP) error at the bases provided. Suffix is: 'error'.
OVERLAPPING_ERROR
Only considers bases from the overlapping parts of reads from the same template. For those bases, it calculates the error that can be attributable to pre-sequencing, versus during-sequencing. Suffix is: 'overlapping_error'.
INDEL_ERROR
Collects insertion and deletion errors at the bases provided. Suffix is: 'indel_error'.

ErrorType  null


--FILE_EXTENSION / -EXT

Append the given file extension to all metric file names (ex. OUTPUT.insert_size_metrics.EXT). No extension by default.

String  ""


--help / -h

display the help message

boolean  false


--INPUT / -I

Input SAM or BAM file.

R String  null


--INTERVAL_ITERATOR

Iterate through the file assuming it consists of a pre-created subset interval of the full genome. This enables fast processing of files with reads at disparate parts of the genome. Requires that the provided VCF file is indexed.

boolean  false


--INTERVALS / -L

Region(s) to limit analysis to. Supported formats are VCF or interval_list. Will *intersect* inputs if multiple are given. When this argument is supplied, the VCF provided must be *indexed*.

List[File]  []


--LOCATION_BIN_SIZE / -LBS

Size of location bins. Used by the FLOWCELL_X and FLOWCELL_Y stratifiers

int  2500  [ [ -∞  ∞ ] ]


--LONG_HOMOPOLYMER / -LH

Shortest homopolymer which is considered long. Used by the BINNED_HOMOPOLYMER stratifier.

int  6  [ [ -∞  ∞ ] ]


--MAX_LOCI / -MAX

Maximum number of loci to process (or unlimited if 0).

long  0  [ [ -∞  ∞ ] ]


--MAX_RECORDS_IN_RAM

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--MIN_BASE_Q / -BQ

Minimum base quality to include base.

int  20  [ [ -∞  ∞ ] ]


--MIN_MAPPING_Q / -MQ

Minimum mapping quality to include read.

int  20  [ [ -∞  ∞ ] ]


--OUTPUT / -O

Base name for output files. Actual file names will be generated from the basename and suffixes from the ERROR and STRATIFIER by adding a '.' and then error_by_stratifier[_and_stratifier]* where 'error' is ERROR's extension, and 'stratifier' is STRATIFIER's suffix. For example, an ERROR_METRIC of ERROR:BASE_QUALITY:GC_CONTENT will produce an extension '.error_by_base_quality_and_gc'. The suffixes can be found in the documentation for ERROR_VALUE and SUFFIX_VALUE.

R File  null


--PRIOR_Q / -PE

The prior error, in phred-scale (used for calculating empirical error rates).

int  30  [ [ -∞  ∞ ] ]


--PROBABILITY / -P

The probability of selecting a locus for analysis (for downsampling).

double  1.0  [ [ -∞  ∞ ] ]


--PROGRESS_STEP_INTERVAL

The interval between which progress will be displayed.

int  100000  [ [ -∞  ∞ ] ]


--QUIET

Whether to suppress job-summary info on System.err.

Boolean  false


--REFERENCE_SEQUENCE / -R

Reference sequence file.

R PicardHtsPath  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--STRATIFIER_VALUE

A fake argument used to show the options of STRATIFIER (in ERROR_METRICS).

The --STRATIFIER_VALUE argument is an enumerated type (Stratifier), which can have one of the following values:

ALL
Puts all bases in the same stratum. Suffix is 'all'.
GC_CONTENT
The GC-content of the read. Suffix is 'gc'.
READ_ORDINALITY
The read ordinality (i.e. first or second). Suffix is 'read_ordinality'.
READ_BASE
the base in the original reading direction. Suffix is 'read_base'.
READ_DIRECTION
The alignment direction of the read (encoded as + or -). Suffix is 'read_direction'.
PAIR_ORIENTATION
The read-pair's orientation (encoded as '[FR]1[FR]2'). Suffix is 'pair_orientation'.
PAIR_PROPERNESS
The properness of the read-pair's alignment. Looks for indications of chimerism. Suffix is 'pair_proper'.
REFERENCE_BASE
The reference base in the read's direction. Suffix is 'ref_base'.
PRE_DINUC
The read base at the previous cycle, and the current reference base. Suffix is 'pre_dinuc'.
POST_DINUC
The read base at the subsequent cycle, and the current reference base. Suffix is 'post_dinuc'.
HOMOPOLYMER_LENGTH
The length of homopolymer the base is part of (only accounts for bases that were read prior to the current base). Suffix is 'homopolymer_length'.
HOMOPOLYMER
The length of homopolymer, the base that the homopolymer is comprised of, and the reference base. Suffix is 'homopolymer_and_following_ref_base'.
BINNED_HOMOPOLYMER
The scale of homopolymer (long or short), the base that the homopolymer is comprised of, and the reference base. Suffix is 'binned_length_homopolymer_and_following_ref_base'.
FLOWCELL_TILE
The flowcell and tile where the base was read (taken from the read name). Suffix is 'tile'.
FLOWCELL_Y
The y-coordinate of the read (taken from the read name) Suffix is 'y'.
FLOWCELL_X
The x-coordinate of the read (taken from the read name) Suffix is 'x'.
READ_GROUP
The read-group id of the read. Suffix is 'read_group'.
CYCLE
The machine cycle during which the base was read. Suffix is 'cycle'.
BINNED_CYCLE
The binned machine cycle. Similar to CYCLE, but binned into 5 evenly spaced ranges across the size of the read. This stratifier may produce confusing results when used on datasets with variable sized reads. Suffix is 'binned_cycle'.
SOFT_CLIPS
The number of softclipped bases the read has. Suffix is 'softclipped_bases'.
INSERT_LENGTH
The insert-size they came from (taken from the TLEN field.) Suffix is 'insert_length'.
BASE_QUALITY
The base quality. Suffix is 'base_quality'.
MAPPING_QUALITY
The read's mapping quality. Suffix is 'mapping_quality'.
MISMATCHES_IN_READ
The number of bases in the read that mismatch the reference, excluding the current base. This stratifier requires the NM tag. Suffix is 'mismatches_in_read'.
ONE_BASE_PADDED_CONTEXT
The current reference base and a one base padded region from the read resulting in a 3-base context. Suffix is 'one_base_padded_context'.
TWO_BASE_PADDED_CONTEXT
The current reference base and a two base padded region from the read resulting in a 5-base context. Suffix is 'two_base_padded_context'.
CONSENSUS
Whether or not duplicate reads were used to form a consensus read. This stratifier makes use of the aD, bD, and cD tags for duplex consensus reads. If the reads are single index consensus, only the cD tags are used. Suffix is 'consensus'.
NS_IN_READ
The number of Ns in the read. Suffix is 'ns_in_read'.
INSERTIONS_IN_READ
The number of Insertions in the read cigar. Suffix is 'cigar_elements_I_in_read'.
DELETIONS_IN_READ
The number of Deletions in the read cigar. Suffix is 'cigar_elements_D_in_read'.
INDELS_IN_READ
The number of INDELs in the read cigar. Suffix is 'indels_in_read'.
INDEL_LENGTH
The number of bases in an indel Suffix is 'indel_length'.

Stratifier  null


--TMP_DIR

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VCF / -V

VCF of known variation for sample. program will skip over polymorphic sites in this VCF and avoid collecting data on these loci.

String  null


--VERBOSITY

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.