Showing tool doc from version 4.6.2.0 | The latest version is
4.6.2.0

IlluminaBasecallsToFastq (Picard)

Generate FASTQ file(s) from Illumina basecall read data.

This tool generates FASTQ files from data in an Illumina BaseCalls output directory. Separate FASTQ files are created for each template, barcode, and index (molecular barcode) read. Briefly, the template reads are the target sequence of your experiment, the barcode sequence reads facilitate sample demultiplexing, and the index reads help mitigate instrument phasing errors. For additional information on the read types, please see the following reference here.

In the absence of sample pooling (multiplexing) and/or barcodes, then an OUTPUT_PREFIX (file directory) must be provided as the sample identifier. For multiplexed samples, a MULTIPLEX_PARAMS file must be specified. The MULTIPLEX_PARAMS file contains the list of sample barcodes used to sort template, barcode, and index reads. It is essentially the same as the BARCODE_FILE used in theExtractIlluminaBarcodes tool.

Barcode matching can be done inline without requiring barcodes files generated by `ExtractIlluminaBarcode`. By setting MATCH_BARCODES_INLINE to true barcodes will be matched as they are parsed and converted. Thisdoes not require BARCODES_DIR.

Files from this tool use the following naming format: {prefix}.{type}_{number}.fastq with the {prefix} indicating the sample barcode, the {type} indicating the types of reads e.g. index, barcode, or blank (if it contains a template read). The {number} indicates the read number, either first (1) or second (2) for paired-end sequencing.

Usage examples:

Example 1: Sample(s) with either no barcode or barcoded without multiplexing 
java -jar picard.jar IlluminaBasecallsToFastq \
READ_STRUCTURE=25T8B25T \
BASECALLS_DIR=basecallDirectory \
LANE=001 \
OUTPUT_PREFIX=noBarcode.1 \
RUN_BARCODE=run15 \
FLOWCELL_BARCODE=abcdeACXX

Example 2: Multiplexed samples
java -jar picard.jar IlluminaBasecallsToFastq \
READ_STRUCTURE=25T8B25T \
BASECALLS_DIR=basecallDirectory \
LANE=001 \
MULTIPLEX_PARAMS=demultiplexed_output.txt \
RUN_BARCODE=run15 \
FLOWCELL_BARCODE=abcdeACXX

The FLOWCELL_BARCODE is required if emitting Casava 1.8-style read name headers.


Category Base Calling


Overview

IlluminaBasecallsToFastq (Picard) specific arguments

This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.

Argument name(s) Default value Summary
Required Arguments
--BASECALLS_DIR
 -B
The Illumina basecalls directory.
--LANE
 -L
Lane number. This can be specified multiple times. Reads with the same index in multiple lanes will be added to the same output file.
--MULTIPLEX_PARAMS
Tab-separated file for creating all output FASTQs demultiplexed by barcode for a lane with single IlluminaBasecallsToFastq invocation. The columns are OUTPUT_PREFIX, and BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify an output_prefix for no barcode match.
--OUTPUT_PREFIX
 -O
The prefix for output FASTQs. Extensions as described above are appended. Use this option for a non-barcoded run, or for a barcoded run in which it is not desired to demultiplex reads into separate files by barcode.
--READ_STRUCTURE
 -RS
A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.
--RUN_BARCODE
The barcode of the run. Prefixed to read names.
Optional Tool Arguments
--ADAPTERS_TO_CHECK
Which adapters to look for in the reads. The default value is null, meaning that no adapters will be looked for in the reads.
--APPLY_EAMSS_FILTER
true Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2.
--arguments_file
read one or more arguments files and add them to the command line
--BARCODES_DIR
 -BCD
The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR.
--COMPRESS_OUTPUTS
 -GZIP
false Compress output FASTQ files using gzip and append a .gz extension to the file names.
--DISTANCE_MODE
HAMMING The distance metric that should be used to compare the barcode-reads and the provided barcodes for finding the best and second-best assignments.
--FIRST_TILE
If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order.
--FIVE_PRIME_ADAPTER
For specifying adapters other than standard Illumina
--FLOWCELL_BARCODE
The barcode of the flowcell that was sequenced; required if emitting Casava1.8-style read name headers
--FORCE_GC
true If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory.
--help
 -h
false display the help message
--IGNORE_UNEXPECTED_BARCODES
 -INGORE_UNEXPECTED
false Whether to ignore reads whose barcodes are not found in MULTIPLEX_PARAMS. Useful when outputting FASTQs for only a subset of the barcodes in a lane.
--INCLUDE_NON_PF_READS
 -NONPF
true Whether to include non-PF reads
--INPUT_PARAMS_FILE
The input file that defines parameters for the program. This is the BARCODE_FILE for `ExtractIlluminaBarcodes` or the MULTIPLEX_PARAMS or LIBRARY_PARAMS file for `IlluminaBasecallsToFastq` or `IlluminaBasecallsToSam`
--MACHINE_NAME
The name of the machine on which the run was sequenced; required if emitting Casava1.8-style read name headers
--MATCH_BARCODES_INLINE
false If true, match barcodes on the fly. Otherwise parse the barcodes from the barcodes file.
--MAX_MISMATCHES
1 Maximum mismatches for a barcode to be considered a match.
--MAX_NO_CALLS
2 Maximum allowable number of no-calls in a barcode read before it is considered unmatchable.
--MAX_READS_IN_RAM_PER_TILE
-1 Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. Deprecated: use `MAX_RECORDS_IN_RAM`
--METRICS_FILE
 -M
Per-barcode and per-lane metrics written to this file.
--MIN_MISMATCH_DELTA
1 Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match.
--MIN_TRIMMED_LENGTH
20 The minimum length for a trimmed read. If trimming would create a smaller read, then trim to this length instead
--MINIMUM_BASE_QUALITY
 -Q
0 Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even if the bases match.
--MINIMUM_QUALITY
2 The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown. The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.
--NUM_PROCESSORS
0 The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS.
--READ_NAME_FORMAT
CASAVA_1_8 The read name header formatting to emit. Casava1.8 formatting has additional information beyond Illumina, including: the passing-filter flag value for the read, the flowcell name, and the sequencer name.
--SORT
true If true, the output records are sorted by read name. Otherwise they are output in the same order that the data was produced on the sequencer (ordered by tile and position).
--THREE_PRIME_ADAPTER
For specifying adapters other than standard Illumina
--TILE_LIMIT
If set, process no more than this many tiles (used for debugging).
--TRIMMING_QUALITY
The quality to use as a threshold for trimming.
--version
false display the version number for this tool
Optional Common Arguments
--COMPRESSION_LEVEL
5 Compression level for all compressed files created (e.g. BAM and VCF).
--CREATE_INDEX
false Whether to create an index when writing VCF or coordinate sorted BAM output.
--CREATE_MD5_FILE
false Whether to create an MD5 digest for any BAM or FASTQ files created.
--MAX_RECORDS_IN_RAM
500000 When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
--QUIET
false Whether to suppress job-summary info on System.err.
--REFERENCE_SEQUENCE
 -R
Reference sequence file.
--TMP_DIR
One or more directories with space available to be used by this program for temporary storage of working files
--USE_JDK_DEFLATER
 -use_jdk_deflater
false Use the JDK Deflater instead of the Intel Deflater for writing compressed output
--USE_JDK_INFLATER
 -use_jdk_inflater
false Use the JDK Inflater instead of the Intel Inflater for reading compressed input
--VALIDATION_STRINGENCY
STRICT Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
--VERBOSITY
INFO Control verbosity of logging.
Advanced Arguments
--showHidden
false display hidden arguments

Argument details

Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.


--ADAPTERS_TO_CHECK

Which adapters to look for in the reads. The default value is null, meaning that no adapters will be looked for in the reads.

The --ADAPTERS_TO_CHECK argument is an enumerated type (List[IlluminaAdapterPair]), which can have one of the following values:

PAIRED_END
The following sequences can be found in https://support.illumina.com/content/dam/illumina-support/documents/documentation/chemistry_documentation/experiment-design/illumina-adapter-sequences_1000000002694-01.pdf and are protected by the following copyright notice: Oligonucleotide sequences (c) 2016 Illumina, Inc. All rights reserved. Derivative works created by Illumina customers are authorized for use with Illumina instruments and products only. All other uses are strictly prohibited.
INDEXED
SINGLE_END
NEXTERA_V1
NEXTERA_V2
DUAL_INDEXED
FLUIDIGM
TRUSEQ_SMALLRNA
ALTERNATIVE_SINGLE_END

List[IlluminaAdapterPair]  []


--APPLY_EAMSS_FILTER

Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2.

boolean  true


--arguments_file

read one or more arguments files and add them to the command line

List[File]  []


--BARCODES_DIR / -BCD

The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR.

File  null


--BASECALLS_DIR / -B

The Illumina basecalls directory.

R File  null


--COMPRESS_OUTPUTS / -GZIP

Compress output FASTQ files using gzip and append a .gz extension to the file names.

boolean  false


--COMPRESSION_LEVEL

Compression level for all compressed files created (e.g. BAM and VCF).

int  5  [ [ -∞  ∞ ] ]


--CREATE_INDEX

Whether to create an index when writing VCF or coordinate sorted BAM output.

Boolean  false


--CREATE_MD5_FILE

Whether to create an MD5 digest for any BAM or FASTQ files created.

boolean  false


--DISTANCE_MODE

The distance metric that should be used to compare the barcode-reads and the provided barcodes for finding the best and second-best assignments.

The --DISTANCE_MODE argument is an enumerated type (DistanceMetric), which can have one of the following values:

HAMMING
Hamming distance: The n-th base in the read is compared against the n-th base in the barcode. Unequal bases and low quality bases are considered mismatches. No-call read-bases are not considered mismatches.
LENIENT_HAMMING
Leniant Hamming distance: The n-th base in the read is compared against the n-th base in the barcode. Unequal bases are considered mismatches. No-call read-bases, or those with low quality are not considered mismatches.
FREE
FREE Metric: A Levenshtein-like metric that performs a simple Smith-Waterman with mismatch, gap open, and gap extend costs all equal to 1. Insertions or deletions at the ends of the read or barcode do not count toward the distance. No-call read-bases, or those with low quality are not considered mismatches.

DistanceMetric  HAMMING


--FIRST_TILE

If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order.

Integer  null


--FIVE_PRIME_ADAPTER

For specifying adapters other than standard Illumina

String  null


--FLOWCELL_BARCODE

The barcode of the flowcell that was sequenced; required if emitting Casava1.8-style read name headers

String  null


--FORCE_GC

If true, call System.gc() periodically. This is useful in cases in which the -Xmx value passed is larger than the available memory.

Boolean  true


--help / -h

display the help message

boolean  false


--IGNORE_UNEXPECTED_BARCODES / -INGORE_UNEXPECTED

Whether to ignore reads whose barcodes are not found in MULTIPLEX_PARAMS. Useful when outputting FASTQs for only a subset of the barcodes in a lane.

boolean  false


--INCLUDE_NON_PF_READS / -NONPF

Whether to include non-PF reads

boolean  true


--INPUT_PARAMS_FILE

The input file that defines parameters for the program. This is the BARCODE_FILE for `ExtractIlluminaBarcodes` or the MULTIPLEX_PARAMS or LIBRARY_PARAMS file for `IlluminaBasecallsToFastq` or `IlluminaBasecallsToSam`

File  null


--LANE / -L

Lane number. This can be specified multiple times. Reads with the same index in multiple lanes will be added to the same output file.

R List[Integer]  []


--MACHINE_NAME

The name of the machine on which the run was sequenced; required if emitting Casava1.8-style read name headers

String  null


--MATCH_BARCODES_INLINE

If true, match barcodes on the fly. Otherwise parse the barcodes from the barcodes file.

Boolean  false


--MAX_MISMATCHES

Maximum mismatches for a barcode to be considered a match.

int  1  [ [ -∞  ∞ ] ]


--MAX_NO_CALLS

Maximum allowable number of no-calls in a barcode read before it is considered unmatchable.

int  2  [ [ -∞  ∞ ] ]


--MAX_READS_IN_RAM_PER_TILE

Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. Deprecated: use `MAX_RECORDS_IN_RAM`

int  -1  [ [ -∞  ∞ ] ]


--MAX_RECORDS_IN_RAM

When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.

Integer  500000  [ [ -∞  ∞ ] ]


--METRICS_FILE / -M

Per-barcode and per-lane metrics written to this file.

File  null


--MIN_MISMATCH_DELTA

Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match.

int  1  [ [ -∞  ∞ ] ]


--MIN_TRIMMED_LENGTH

The minimum length for a trimmed read. If trimming would create a smaller read, then trim to this length instead

Integer  20  [ [ -∞  ∞ ] ]


--MINIMUM_BASE_QUALITY / -Q

Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even if the bases match.

int  0  [ [ -∞  ∞ ] ]


--MINIMUM_QUALITY

The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown. The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.

int  2  [ [ -∞  ∞ ] ]


--MULTIPLEX_PARAMS

Tab-separated file for creating all output FASTQs demultiplexed by barcode for a lane with single IlluminaBasecallsToFastq invocation. The columns are OUTPUT_PREFIX, and BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify an output_prefix for no barcode match.

Exclusion: This argument cannot be used at the same time as OUTPUT_PREFIX.

R File  null


--NUM_PROCESSORS

The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS.

Integer  0  [ [ -∞  ∞ ] ]


--OUTPUT_PREFIX / -O

The prefix for output FASTQs. Extensions as described above are appended. Use this option for a non-barcoded run, or for a barcoded run in which it is not desired to demultiplex reads into separate files by barcode.

Exclusion: This argument cannot be used at the same time as MULTIPLEX_PARAMS.

R File  null


--QUIET

Whether to suppress job-summary info on System.err.

Boolean  false


--READ_NAME_FORMAT

The read name header formatting to emit. Casava1.8 formatting has additional information beyond Illumina, including: the passing-filter flag value for the read, the flowcell name, and the sequencer name.

The --READ_NAME_FORMAT argument is an enumerated type (ReadNameFormat), which can have one of the following values:

CASAVA_1_8
ILLUMINA

ReadNameFormat  CASAVA_1_8


--READ_STRUCTURE / -RS

A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.

R String  null


--REFERENCE_SEQUENCE / -R

Reference sequence file.

PicardHtsPath  null


--RUN_BARCODE

The barcode of the run. Prefixed to read names.

R String  null


--showHidden / -showHidden

display hidden arguments

boolean  false


--SORT

If true, the output records are sorted by read name. Otherwise they are output in the same order that the data was produced on the sequencer (ordered by tile and position).

Boolean  true


--THREE_PRIME_ADAPTER

For specifying adapters other than standard Illumina

String  null


--TILE_LIMIT

If set, process no more than this many tiles (used for debugging).

Integer  null


--TMP_DIR

One or more directories with space available to be used by this program for temporary storage of working files

List[File]  []


--TRIMMING_QUALITY

The quality to use as a threshold for trimming.

Integer  null


--USE_JDK_DEFLATER / -use_jdk_deflater

Use the JDK Deflater instead of the Intel Deflater for writing compressed output

Boolean  false


--USE_JDK_INFLATER / -use_jdk_inflater

Use the JDK Inflater instead of the Intel Inflater for reading compressed input

Boolean  false


--VALIDATION_STRINGENCY

Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.

The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:

STRICT
LENIENT
SILENT

ValidationStringency  STRICT


--VERBOSITY

Control verbosity of logging.

The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:

ERROR
WARNING
INFO
DEBUG

LogLevel  INFO


--version

display the version number for this tool

boolean  false


Return to top


See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum

GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.