Showing tool doc from version 4.6.2.0 | The latest version is
4.6.2.0

ExtractIlluminaBarcodes (Picard)

Tool determines the barcode for each read in an Illumina lane.

This tool determines the numbers of reads containing barcode-matching sequences and provides statistics on the quality of these barcode matches.

Illumina sequences can contain at least two types of barcodes, sample and molecular (index). Sample barcodes (B in the read structure) are used to demultiplex pooled samples while index barcodes (M in the read structure) are used to differentiate multiple reads of a template when carrying out paired-end sequencing. Note that this tool only extracts sample (B) and not molecular barcodes (M).

Barcodes can be provided in the form of a list (BARCODE_FILE) or a string representing the barcode (BARCODE). The BARCODE_FILE contains multiple fields including 'barcode_sequence' (or 'barcode_sequence_1'), 'barcode_sequence_2' (optional), 'barcode_name', and 'library_name'. In contrast, the BARCODE argument is used for runs with reads containing a single barcode (nonmultiplexed) and can be added directly as a string of text e.g. BARCODE=CAATAGCG.

Data is output per lane/tile within the BaseCalls directory with the file name format of 's_{lane}_{tile}_barcode.txt'. These files contain the following tab-separated columns:

If there is no match but we're close to the threshold of calling it a match, we output the barcode that would have been matched but in lower case. Threshold values can be adjusted to accommodate barcode sequence mismatches from the reads. The metrics file produced by the ExtractIlluminaBarcodes program indicates the number of matches (and mismatches) between the barcode reads and the actual barcodes. These metrics are provided both per-barcode and per lane and can be found in the BaseCalls directory.

For poorly matching barcodes, the order of specification of barcodes can cause arbitrary output differences.

Usage example:

java -jar picard.jar ExtractIlluminaBarcodes \
BASECALLS_DIR=/BaseCalls/ \
LANE=1 \
READ_STRUCTURE=25T8B25T \
BARCODE_FILE=barcodes.txt \
METRICS_FILE=metrics_output.txt
Please see the ExtractIlluminaBarcodes.BarcodeMetric definitions for a complete description of the metrics produced by this tool.


Category Base Calling


Overview

Determine the barcode for each read in an Illumina lane. For each tile, a file is written to the basecalls directory of the form s___barcode.txt. An output file contains a line for each read in the tile, aligned with the regular basecall output The output file contains the following tab-separated columns: - read subsequence at barcode position - Y or N indicating if there was a barcode match - matched barcode sequence (empty if read did not match one of the barcodes). If there is no match but we're close to the threshold of calling it a match we output the barcode that would have been matched but in lower case - distance to best matching barcode, "mismatches" (*) - distance to second-best matching barcode, "mismatchesToSecondBest" (*) NOTE (*): Due to an optimization the reported mismatches & mismatchesToSecondBest values may be inaccurate as long as the conclusion (match vs. no-match) isn't affected. For example, reported mismatches and mismatchesToSecondBest may be smaller than their true value if mismatches is truly larger than MAX_MISMATCHES. Also, mismatchesToSecondBest might be smaller than its true value if its true value is greater than mismatches + MIN_MISMATCH_DELTA.

ExtractIlluminaBarcodes (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--BARCODE
Barcode sequence. These must be unique, and all the same length. This cannot be used with reads that have more than one barcode; use BARCODE_FILE in that case.
--BARCODE_FILE
Tab-delimited file of barcode sequences, barcode name and, optionally, library name. Barcodes must be unique and all the same length. Column headers must be 'barcode_sequence' (or 'barcode_sequence_1'), 'barcode_sequence_2' (optional), 'barcode_name', and 'library_name'.
--BASECALLS_DIR
 -B
The Illumina basecalls directory.
--LANE
 -L
Lane number. This can be specified multiple times. Reads with the same index in multiple lanes will be added to the same output file.
--READ_STRUCTURE
 -RS
A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.
Optional Tool Arguments
--arguments_file
read one or more arguments files and add them to the command line
--COMPRESS_OUTPUTS
 -GZIP
false Compress output FASTQ files using gzip and append a .gz extension to the file names.
--DISTANCE_MODE
HAMMING The distance metric that should be used to compare the barcode-reads and the provided barcodes for finding the best and second-best assignments.
--help
 -h
false display the help message
--INPUT_PARAMS_FILE
The input file that defines parameters for the program. This is the BARCODE_FILE for `ExtractIlluminaBarcodes` or the MULTIPLEX_PARAMS or LIBRARY_PARAMS file for `IlluminaBasecallsToFastq` or `IlluminaBasecallsToSam`
--MAX_MISMATCHES
1 Maximum mismatches for a barcode to be considered a match.
--MAX_NO_CALLS
2 Maximum allowable number of no-calls in a barcode read before it is considered unmatchable.
--METRICS_FILE
 -M
Per-barcode and per-lane metrics written to this file.
--MIN_MISMATCH_DELTA
1 Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match.
--MINIMUM_BASE_QUALITY
 -Q
0 Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even if the bases match.
--MINIMUM_QUALITY
2 The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown. The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.
--NUM_PROCESSORS
1 Run this many PerTileBarcodeExtractors in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0 then the number of cores used will be the number available on the machine less NUM_PROCESSORS.
--OUTPUT_DIR
Where to write _barcode.txt files. By default, these are written to BASECALLS_DIR.
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create an index when writing VCF or coordinate sorted BAM output.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--REFERENCE_SEQUENCE
 -R
Reference sequence file.
--TMP_DIR
One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--arguments_file

read one or more arguments files and add them to the command line

List[File]  []


--BARCODE

Barcode sequence. These must be unique, and all the same length. This cannot be used with reads that have more than one barcode; use BARCODE_FILE in that case.

Exclusion: This argument cannot be used at the same time as BARCODE_FILE.

R List[String]  []


--BARCODE_FILE

Tab-delimited file of barcode sequences, barcode name and, optionally, library name. Barcodes must be unique and all the same length. Column headers must be 'barcode_sequence' (or 'barcode_sequence_1'), 'barcode_sequence_2' (optional), 'barcode_name', and 'library_name'.

Exclusion: This argument cannot be used at the same time as BARCODE.

R File  null


--BASECALLS_DIR / -B

The Illumina basecalls directory.

R File  null


--COMPRESS_OUTPUTS / -GZIP

Compress output FASTQ files using gzip and append a .gz extension to the file names.

boolean  false


--COMPRESSION_LEVEL

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX

Whether to create an index when writing VCF or coordinate sorted BAM output.

Boolean  false


--CREATE_MD5_FILE

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--DISTANCE_MODE

The distance metric that should be used to compare the barcode-reads and the provided barcodes for finding the best and second-best assignments.

The --DISTANCE_MODE argument is an enumerated type (DistanceMetric), which can have one of the following values:

HAMMING
Hamming distance: The n-th base in the read is compared against the n-th base in the barcode. Unequal bases and low quality bases are considered mismatches. No-call read-bases are not considered mismatches.
LENIENT_HAMMING
Leniant Hamming distance: The n-th base in the read is compared against the n-th base in the barcode. Unequal bases are considered mismatches. No-call read-bases, or those with low quality are not considered mismatches.
FREE
FREE Metric: A Levenshtein-like metric that performs a simple Smith-Waterman with mismatch, gap open, and gap extend costs all equal to 1. Insertions or deletions at the ends of the read or barcode do not count toward the distance. No-call read-bases, or those with low quality are not considered mismatches.

DistanceMetric  HAMMING


--help / -h

display the help message

boolean  false


--INPUT_PARAMS_FILE

The input file that defines parameters for the program. This is the BARCODE_FILE for `ExtractIlluminaBarcodes` or the MULTIPLEX_PARAMS or LIBRARY_PARAMS file for `IlluminaBasecallsToFastq` or `IlluminaBasecallsToSam`

File  null


--LANE / -L

Lane number. This can be specified multiple times. Reads with the same index in multiple lanes will be added to the same output file.

R List[Integer]  []


--MAX_MISMATCHES

Maximum mismatches for a barcode to be considered a match.

int  1  [ [ -∞  ∞ ] ]


--MAX_NO_CALLS

Maximum allowable number of no-calls in a barcode read before it is considered unmatchable.

int  2  [ [ -∞  ∞ ] ]


--MAX_RECORDS_IN_RAM

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--METRICS_FILE / -M

Per-barcode and per-lane metrics written to this file.

File  null


--MIN_MISMATCH_DELTA

Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match.

int  1  [ [ -∞  ∞ ] ]


--MINIMUM_BASE_QUALITY / -Q

Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even if the bases match.

int  0  [ [ -∞  ∞ ] ]


--MINIMUM_QUALITY

The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown. The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.

int  2  [ [ -∞  ∞ ] ]


--NUM_PROCESSORS

Run this many PerTileBarcodeExtractors in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0 then the number of cores used will be the number available on the machine less NUM_PROCESSORS.

int  1  [ [ -∞  ∞ ] ]


--OUTPUT_DIR

Where to write _barcode.txt files. By default, these are written to BASECALLS_DIR.

File  null


--QUIET

Whether to suppress job-summary info on System.err.

Boolean  false


--READ_STRUCTURE / -RS

A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.

R String  null


--REFERENCE_SEQUENCE / -R

Reference sequence file.

PicardHtsPath  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--TMP_DIR

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VERBOSITY

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.