Transforms raw Illumina sequencing data into an unmapped SAM, BAM or CRAM file.
The IlluminaBaseCallsToSam program collects, demultiplexes, and sorts reads across all of the tiles of a lane via barcode to produce an unmapped SAM, BAM or CRAM file. An unmapped BAM file is often referred to as a uBAM. All barcode, sample, and library data is provided in the LIBRARY_PARAMS file. Note, this LIBRARY_PARAMS file should be formatted according to the specifications indicated below. The following is an example of a properly formatted LIBRARY_PARAMS file:
BARCODE_1 OUTPUT SAMPLE_ALIAS LIBRARY_NAME AAAAAAAA SA_AAAAAAAA.bam SA_AAAAAAAA LN_AAAAAAAA AAAAGAAG SA_AAAAGAAG.bam SA_AAAAGAAG LN_AAAAGAAG AACAATGG SA_AACAATGG.bam SA_AACAATGG LN_AACAATGG N SA_non_indexed.bam SA_non_indexed LN_NNNNNNNNThe BARCODES_DIR file is produced by the ExtractIlluminaBarcodes tool for each lane of a flow cell.
Barcode matching can be done inline without requiring barcodes files generated by `ExtractIlluminaBarcode`. By setting MATCH_BARCODES_INLINE to true barcodes will be matched as they are parsed and converted. Thisdoes not require BARCODES_DIR.
java -jar picard.jar IlluminaBasecallsToSam \
BASECALLS_DIR=/BaseCalls/ \
LANE=001 \
READ_STRUCTURE=25T8B25T \
RUN_BARCODE=run15 \
IGNORE_UNEXPECTED_BARCODES=true \
LIBRARY_PARAMS=library.params
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
| Argument name(s) | Default value | Summary | |
|---|---|---|---|
| Required Arguments | |||
| --BARCODE_PARAMS |
Deprecated (use LIBRARY_PARAMS). Tab-separated file for creating all output SAM, BAM or CRAM files for barcoded run with single IlluminaBasecallsToSam invocation. Columns are BARCODE, OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME. Row with BARCODE=N is used to specify a file for no barcode match | ||
| --BASECALLS_DIR -B |
The Illumina basecalls directory. | ||
| --LANE -L |
Lane number. This can be specified multiple times. Reads with the same index in multiple lanes will be added to the same output file. | ||
| --LIBRARY_PARAMS |
Tab-separated file for creating all output SAM, BAM or CRAM files for a lane with single IlluminaBasecallsToSam invocation. The columns are OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME, BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify a file for no barcode match. You may also provide any 2 letter RG header attributes (excluding PU, CN, PL, and DT) as columns in this file and the values for those columns will be inserted into the RG tag for the SAM, BAM or CRAM file created for a given row. | ||
| --OUTPUT -O |
Deprecated (use LIBRARY_PARAMS). The output SAM, BAM or CRAM file. Format is determined by extension. | ||
| --READ_STRUCTURE -RS |
A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads: * read one with 28 cycles (bases) of template * read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode) * read three with 8 cycles (bases) of sample barcode * 8 cycles (bases) skipped. * read four with 28 cycles (bases) of template The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein. | ||
| --RUN_BARCODE |
The barcode of the run. Prefixed to read names. | ||
| --SAMPLE_ALIAS -ALIAS |
Deprecated (use LIBRARY_PARAMS). The name of the sequenced sample | ||
| --SEQUENCING_CENTER |
The name of the sequencing center that produced the reads. Used to set the @RG->CN header tag. | ||
| Optional Tool Arguments | |||
| --ADAPTERS_TO_CHECK |
[INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM] | Which adapters to look for in the read. | |
| --APPLY_EAMSS_FILTER |
true | Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2. | |
| --arguments_file |
read one or more arguments files and add them to the command line | ||
| --BARCODE_POPULATION_STRATEGY |
ORPHANS_ONLY | When should the sample barcode (as read by the sequencer) be placed on the reads in the BC tag? | |
| --BARCODES_DIR -BCD |
The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR. | ||
| --COMPRESS_OUTPUTS -GZIP |
false | Compress output FASTQ files using gzip and append a .gz extension to the file names. | |
| --DISTANCE_MODE |
HAMMING | The distance metric that should be used to compare the barcode-reads and the provided barcodes for finding the best and second-best assignments. | |
| --FIRST_TILE |
If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order. | ||
| --FIVE_PRIME_ADAPTER |
For specifying adapters other than standard Illumina | ||
| --help -h |
false | display the help message | |
| --IGNORE_UNEXPECTED_BARCODES -IGNORE_UNEXPECTED |
false | Whether to ignore reads whose barcodes are not found in LIBRARY_PARAMS. Useful when outputting SAM, BAM or CRAM files for only a subset of the barcodes in a lane. | |
| --INCLUDE_BARCODE_QUALITY |
false | Should the barcode quality be included when the sample barcode is included? | |
| --INCLUDE_BC_IN_RG_TAG |
false | Whether to include the barcode information in the @RG->BC header tag. Defaults to false until included in the SAM spec. | |
| --INCLUDE_NON_PF_READS -NONPF |
true | Whether to include non-PF reads | |
| --INPUT_PARAMS_FILE |
The input file that defines parameters for the program. This is the BARCODE_FILE for `ExtractIlluminaBarcodes` or the MULTIPLEX_PARAMS or LIBRARY_PARAMS file for `IlluminaBasecallsToFastq` or `IlluminaBasecallsToSam` | ||
| --LIBRARY_NAME -LIB |
Deprecated (use LIBRARY_PARAMS). The name of the sequenced library | ||
| --MATCH_BARCODES_INLINE |
false | If true, match barcodes on the fly. Otherwise parse the barcodes from the barcodes file. | |
| --MAX_MISMATCHES |
1 | Maximum mismatches for a barcode to be considered a match. | |
| --MAX_NO_CALLS |
2 | Maximum allowable number of no-calls in a barcode read before it is considered unmatchable. | |
| --MAX_READS_IN_RAM_PER_TILE |
-1 | Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. Deprecated: use `MAX_RECORDS_IN_RAM` | |
| --METRICS_FILE -M |
Per-barcode and per-lane metrics written to this file. | ||
| --MIN_MISMATCH_DELTA |
1 | Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match. | |
| --MINIMUM_BASE_QUALITY -Q |
0 | Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even if the bases match. | |
| --MINIMUM_QUALITY |
2 | The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown. The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower. | |
| --MOLECULAR_INDEX_BASE_QUALITY_TAG |
QX | The tag to use to store any molecular index base qualities. If more than one molecular index is found, their qualities will be concatenated and stored here (.i.e. the number of "M" operators in the READ_STRUCTURE) | |
| --MOLECULAR_INDEX_TAG |
RX | The tag to use to store any molecular indexes. If more than one molecular index is found, they will be concatenated and stored here. | |
| --NUM_PROCESSORS |
0 | The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS. | |
| --PLATFORM |
ILLUMINA | The name of the sequencing technology that produced the read. | |
| --PROCESS_SINGLE_TILE |
If set, process only the tile number given and prepend the tile number to the output file name. | ||
| --READ_GROUP_ID -RG |
ID used to link RG header record with RG tag in SAM record. If these are unique in SAM files that get merged, merge performance is better. If not specified, READ_GROUP_ID will be set to |
||
| --RUN_START_DATE |
The start date of the run. | ||
| --SORT |
true | If true, the output records are sorted by read name. Otherwise they are unsorted. | |
| --TAG_PER_MOLECULAR_INDEX |
The list of tags to store each molecular index. The number of tags should match the number of molecular indexes. | ||
| --THREE_PRIME_ADAPTER |
For specifying adapters other than standard Illumina | ||
| --TILE_LIMIT |
If set, process no more than this many tiles (used for debugging). | ||
| --version |
false | display the version number for this tool | |
| Optional Common Arguments | |||
| --COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
| --CREATE_INDEX |
false | Whether to create an index when writing VCF or coordinate sorted BAM output. | |
| --CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
| --MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
| --QUIET |
false | Whether to suppress job-summary info on System.err. | |
| --REFERENCE_SEQUENCE -R |
Reference sequence file. | ||
| --TMP_DIR |
One or more directories with space available to be used by this program for temporary storage of working files | ||
| --USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
| --USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
| --VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
| --VERBOSITY |
INFO | Control verbosity of logging. | |
| Advanced Arguments | |||
| --showHidden |
false | display hidden arguments | |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Which adapters to look for in the read.
The --ADAPTERS_TO_CHECK argument is an enumerated type (List[IlluminaAdapterPair]), which can have one of the following values:
List[IlluminaAdapterPair] [INDEXED, DUAL_INDEXED, NEXTERA_V2, FLUIDIGM]
Apply EAMSS filtering to identify inappropriately quality scored bases towards the ends of reads and convert their quality scores to Q2.
boolean true
read one or more arguments files and add them to the command line
List[File] []
Deprecated (use LIBRARY_PARAMS). Tab-separated file for creating all output SAM, BAM or CRAM files for barcoded run with single IlluminaBasecallsToSam invocation. Columns are BARCODE, OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME. Row with BARCODE=N is used to specify a file for no barcode match
Exclusion: This argument cannot be used at the same time as OUTPUT, SAMPLE_ALIAS, LIBRARY_NAME, LIBRARY_PARAMS.
R File null
When should the sample barcode (as read by the sequencer) be placed on the reads in the BC tag?
The --BARCODE_POPULATION_STRATEGY argument is an enumerated type (PopulateBarcode), which can have one of the following values:
PopulateBarcode ORPHANS_ONLY
The barcodes directory with _barcode.txt files (generated by ExtractIlluminaBarcodes). If not set, use BASECALLS_DIR.
File null
The Illumina basecalls directory.
R File null
Compress output FASTQ files using gzip and append a .gz extension to the file names.
boolean false
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
Whether to create an index when writing VCF or coordinate sorted BAM output.
Boolean false
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
The distance metric that should be used to compare the barcode-reads and the provided barcodes for finding the best and second-best assignments.
The --DISTANCE_MODE argument is an enumerated type (DistanceMetric), which can have one of the following values:
DistanceMetric HAMMING
If set, this is the first tile to be processed (used for debugging). Note that tiles are not processed in numerical order.
Exclusion: This argument cannot be used at the same time as PROCESS_SINGLE_TILE.
Integer null
For specifying adapters other than standard Illumina
String null
display the help message
boolean false
Whether to ignore reads whose barcodes are not found in LIBRARY_PARAMS. Useful when outputting SAM, BAM or CRAM files for only a subset of the barcodes in a lane.
boolean false
Should the barcode quality be included when the sample barcode is included?
boolean false
Whether to include the barcode information in the @RG->BC header tag. Defaults to false until included in the SAM spec.
boolean false
Whether to include non-PF reads
boolean true
The input file that defines parameters for the program. This is the BARCODE_FILE for `ExtractIlluminaBarcodes` or the MULTIPLEX_PARAMS or LIBRARY_PARAMS file for `IlluminaBasecallsToFastq` or `IlluminaBasecallsToSam`
File null
Lane number. This can be specified multiple times. Reads with the same index in multiple lanes will be added to the same output file.
R List[Integer] []
Deprecated (use LIBRARY_PARAMS). The name of the sequenced library
Exclusion: This argument cannot be used at the same time as BARCODE_PARAMS, LIBRARY_PARAMS.
String null
Tab-separated file for creating all output SAM, BAM or CRAM files for a lane with single IlluminaBasecallsToSam invocation. The columns are OUTPUT, SAMPLE_ALIAS, and LIBRARY_NAME, BARCODE_1, BARCODE_2 ... BARCODE_X where X = number of barcodes per cluster (optional). Row with BARCODE_1 set to 'N' is used to specify a file for no barcode match. You may also provide any 2 letter RG header attributes (excluding PU, CN, PL, and DT) as columns in this file and the values for those columns will be inserted into the RG tag for the SAM, BAM or CRAM file created for a given row.
Exclusion: This argument cannot be used at the same time as OUTPUT, SAMPLE_ALIAS, LIBRARY_NAME, BARCODE_PARAMS.
R File null
If true, match barcodes on the fly. Otherwise parse the barcodes from the barcodes file.
Boolean false
Maximum mismatches for a barcode to be considered a match.
int 1 [ [ -∞ ∞ ] ]
Maximum allowable number of no-calls in a barcode read before it is considered unmatchable.
int 2 [ [ -∞ ∞ ] ]
Configure SortingCollections to store this many records before spilling to disk. For an indexed run, each SortingCollection gets this value/number of indices. Deprecated: use `MAX_RECORDS_IN_RAM`
int -1 [ [ -∞ ∞ ] ]
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
Per-barcode and per-lane metrics written to this file.
File null
Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match.
int 1 [ [ -∞ ∞ ] ]
Minimum base quality. Any barcode bases falling below this quality will be considered a mismatch even if the bases match.
int 0 [ [ -∞ ∞ ] ]
The minimum quality (after transforming 0s to 1s) expected from reads. If qualities are lower than this value, an error is thrown. The default of 2 is what the Illumina's spec describes as the minimum, but in practice the value has been observed lower.
int 2 [ [ -∞ ∞ ] ]
The tag to use to store any molecular index base qualities. If more than one molecular index is found, their qualities will be concatenated and stored here (.i.e. the number of "M" operators in the READ_STRUCTURE)
String QX
The tag to use to store any molecular indexes. If more than one molecular index is found, they will be concatenated and stored here.
String RX
The number of threads to run in parallel. If NUM_PROCESSORS = 0, number of cores is automatically set to the number of cores available on the machine. If NUM_PROCESSORS < 0, then the number of cores used will be the number available on the machine less NUM_PROCESSORS.
Integer 0 [ [ -∞ ∞ ] ]
Deprecated (use LIBRARY_PARAMS). The output SAM, BAM or CRAM file. Format is determined by extension.
Exclusion: This argument cannot be used at the same time as BARCODE_PARAMS, LIBRARY_PARAMS.
R File null
The name of the sequencing technology that produced the read.
String ILLUMINA
If set, process only the tile number given and prepend the tile number to the output file name.
Exclusion: This argument cannot be used at the same time as FIRST_TILE.
Integer null
Whether to suppress job-summary info on System.err.
Boolean false
ID used to link RG header record with RG tag in SAM record. If these are unique in SAM files that get merged, merge performance is better. If not specified, READ_GROUP_ID will be set to
String null
A description of the logical structure of clusters in an Illumina Run, i.e. a description of the structure IlluminaBasecallsToSam assumes the data to be in. It should consist of integer/character pairs describing the number of cycles and the type of those cycles (B for Sample Barcode, M for molecular barcode, T for Template, and S for skip). E.g. If the input data consists of 80 base clusters and we provide a read structure of "28T8M8B8S28T" then the sequence may be split up into four reads:
* read one with 28 cycles (bases) of template
* read two with 8 cycles (bases) of molecular barcode (ex. unique molecular barcode)
* read three with 8 cycles (bases) of sample barcode
* 8 cycles (bases) skipped.
* read four with 28 cycles (bases) of template
The skipped cycles would NOT be included in an output SAM/BAM file or in read groups therein.
R String null
Reference sequence file.
PicardHtsPath null
The barcode of the run. Prefixed to read names.
R String null
The start date of the run.
Date null
Deprecated (use LIBRARY_PARAMS). The name of the sequenced sample
Exclusion: This argument cannot be used at the same time as BARCODE_PARAMS, LIBRARY_PARAMS.
R String null
The name of the sequencing center that produced the reads. Used to set the @RG->CN header tag.
R String null
display hidden arguments
boolean false
If true, the output records are sorted by read name. Otherwise they are unsorted.
Boolean true
The list of tags to store each molecular index. The number of tags should match the number of molecular indexes.
List[String] []
For specifying adapters other than standard Illumina
String null
If set, process no more than this many tiles (used for debugging).
Integer null
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
ValidationStringency STRICT
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
display the version number for this tool
boolean false
See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum
GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.