Checks the odds that all data in the set of input files come from the same individual. Can be used to cross-check readgroups, libraries, samples, or files. Acceptable inputs include BAM/SAM/CRAM and VCF/GVCF files. Output delivers LOD scores in the form of a CrosscheckMetric file.
java -jar picard.jar CrosscheckFingerprints \
INPUT=sample.with.many.readgroups.bam \
HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \
LOD_THRESHOLD=-5 \
OUTPUT=sample.crosscheck_metrics
java -jar picard.jar CrosscheckFingerprints \
INPUT=sample.one.with.many.readgroups.bam \
INPUT=sample.two.with.many.readgroups.bam \
HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \
LOD_THRESHOLD=-5 \
EXPECT_ALL_GROUPS_TO_MATCH=true \
OUTPUT=sample.crosscheck_metrics
java -jar picard.jar CrosscheckFingerprints \
INPUT=sample.with.many.readgroups.bam \
HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \
LOD_THRESHOLD=-5 \
OUTPUT=sample.crosscheck_metrics
java -jar picard.jar CrosscheckFingerprints \
INPUT=sample.one.with.many.readgroups.bam \
INPUT=sample.two.with.many.readgroups.bam \
HAPLOTYPE_MAP=fingerprinting_haplotype_database.txt \
LOD_THRESHOLD=-5 \
EXPECT_ALL_GROUPS_TO_MATCH=true \
OUTPUT=sample.crosscheck_metrics
This table summarizes the command-line arguments that are specific to this tool. For more details on each argument, see the list further down below the table or click on an argument name to jump directly to that entry in the list.
| Argument name(s) | Default value | Summary | |
|---|---|---|---|
| Required Arguments | |||
| --HAPLOTYPE_MAP -H |
The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://gatk.broadinstitute.org/hc/en-us/articles/360035531672-Haplotype-map-format for details. | ||
| --INPUT -I |
One or more input files (or lists of files) with which to compare fingerprints. | ||
| Optional Tool Arguments | |||
| --ALLOW_DUPLICATE_READS |
false | Allow the use of duplicate reads in performing the comparison. Can be useful when duplicate marking has been overly aggressive and coverage is low. | |
| --arguments_file |
read one or more arguments files and add them to the command line | ||
| --CALCULATE_TUMOR_AWARE_RESULTS |
true | Specifies whether the Tumor-aware result should be calculated. These are time consuming and can roughly double the runtime of the tool. When crosschecking many groups not calculating the tumor-aware results can result in a significant speedup. | |
| --CROSSCHECK_BY |
READGROUP | Specifies which data-type should be used as the basic comparison unit. Fingerprints from readgroups can be "rolled-up" to the LIBRARY, SAMPLE, or FILE level before being compared. Fingerprints from VCF can be be compared by SAMPLE or FILE. | |
| --CROSSCHECK_MODE |
CHECK_SAME_SAMPLE | An argument that controls how crosschecking with both INPUT and SECOND_INPUT should occur. | |
| --EXIT_CODE_WHEN_MISMATCH |
1 | When one or more mismatches between groups is detected, exit with this value instead of 0. | |
| --EXIT_CODE_WHEN_NO_VALID_CHECKS |
1 | When all LOD scores are zero, exit with this value. | |
| --EXPECT_ALL_GROUPS_TO_MATCH |
false | Expect all groups' fingerprints to match, irrespective of their sample names. By default (with this value set to false), groups (readgroups, libraries, files, or samples) with different sample names are expected to mismatch, and those with the same sample name are expected to match. | |
| --GENOTYPING_ERROR_RATE |
0.01 | (Deprecated) Assumed genotyping error rate that provides a floor on the probability that a genotype comes from the expected sample. Must be greater than zero. | |
| --help -h |
false | display the help message | |
| --INPUT_INDEX_MAP |
A tsv with two columns and no header which maps the input files to corresponding indices; to be used when index files are not located next to input files. First column must match the list of inputs. | ||
| --INPUT_SAMPLE_FILE_MAP |
A tsv with two columns representing the sample as it should be used for comparisons to SECOND_INPUT (in the first column) and the source file (in INPUT) for the fingerprint (in the second column). Need only to include the samples that change. Values in column 1 should be unique even in union with the remaining unmapped samples. Values in column 2 should be unique in the file. Will error if more than one sample is found in a file (multi-sample VCF) pointed to in column 2. Should only be used in the presence of SECOND_INPUT. | ||
| --INPUT_SAMPLE_MAP |
A tsv with two columns representing the sample as it appears in the INPUT data (in column 1) and the sample as it should be used for comparisons to SECOND_INPUT (in the second column). Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT. | ||
| --LOD_THRESHOLD -LOD |
0.0 | If any two groups (with the same sample name) match with a LOD score lower than the threshold the tool will exit with a non-zero code to indicate error. Program will also exit with an error if it finds two groups with different sample name that match with a LOD score greater than -LOD_THRESHOLD. LOD score 0 means equal likelihood that the groups match vs. come from different individuals, negative LOD score -N, mean 10^N time more likely that the groups are from different individuals, and +N means 10^N times more likely that the groups are from the same individual. | |
| --LOSS_OF_HET_RATE |
0.5 | The rate at which a heterozygous genotype in a normal sample turns into a homozygous (via loss of heterozygosity) in the tumor (model assumes independent events, so this needs to be larger than reality). | |
| --MATRIX_OUTPUT -MO |
Optional output file to write matrix of LOD scores to. This is less informative than the metrics output and only contains Normal-Normal LOD score (i.e. doesn't account for Loss of Heterozygosity). It is however sometimes easier to use visually. | ||
| --MAX_EFFECT_OF_EACH_HAPLOTYPE_BLOCK |
3.0 | Maximal effect of any single haplotype block on outcome (-log10 of maximal likelihood difference between the different values for the three possible genotypes). | |
| --NUM_THREADS |
1 | The number of threads to use to process files and generate fingerprints. | |
| --OUTPUT -O |
Optional output file to write metrics to. Default is to write to stdout. | ||
| --OUTPUT_ERRORS_ONLY |
false | If true, then only groups that do not relate to each other as expected will have their LODs reported. | |
| --REQUIRE_INDEX_FILES |
false | A boolean value to determine whether input files should only be parsed if index files are available. Without turning this option on, the tool will need to read through the entirety of input files without index files either provided via the INPUT_INDEX_MAP or locally accessible relative to the input, which significantly increases runtime. If set to true and no index is found for a file, an exception will be thrown. This applies for both the INPUT and SECOND_INPUT files. | |
| --SAMPLE_INDIVIDUAL_MAP |
A tsv with two columns representing the individual with which each sample is associated. The first column is the sample id, and the second column is the associated individual id. Values in the first column must be unique. If INPUT_SAMPLE_MAP or SECOND_INPUT_SAMPLE_MAP is also specified, then the values in the first column of this file should be the sample aliases specified in the second columns of INPUT_SAMPLE_MAP and SECOND_INPUT_SAMPLE_MAP, respectively. When this input is specified, expectations for matches will be based on the equality or inequality of the individual ids associated with two samples, as opposed to the sample ids themselves. Samples which are not listed in this file will have their sample id used as their individual id, for the purposes of match expectations. This means that one sample id could be used as the individual id for another sample, but not included in the map itself, and these two samples would be considered to have come from the same individual. Note that use of this parameter only affects labelling of matches and mismatches as EXPECTED or UNEXPECTED. It has no affect on how data is grouped for crosschecking. | ||
| --SECOND_INPUT -SI |
A second set of input files (or lists of files) with which to compare fingerprints. If this option is provided the tool compares each sample in INPUT with the sample from SECOND_INPUT that has the same sample ID. In addition, data will be grouped by SAMPLE regardless of the value of CROSSCHECK_BY. When operating in this mode, each sample in INPUT must also have a corresponding sample in SECOND_INPUT. If this is violated, the tool will proceed to check the matching samples, but report the missing samples and return a non-zero error-code. | ||
| --SECOND_INPUT_INDEX_MAP |
A tsv with two columns and no header which maps the second input files to corresponding indices; to be used when index files are not located next to second input files. First column must match the list of second inputs. | ||
| --SECOND_INPUT_SAMPLE_MAP |
A tsv with two columns representing the sample as it appears in the SECOND_INPUT data (in column 1) and the sample as it should be used for comparisons to INPUT (in the second column). Note that in case of unrolling files (file-of-filenames) one would need to reference the final file, i.e. the file that contains the genomic data. Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT. | ||
| --version |
false | display the version number for this tool | |
| Optional Common Arguments | |||
| --COMPRESSION_LEVEL |
5 | Compression level for all compressed files created (e.g. BAM and VCF). | |
| --CREATE_INDEX |
false | Whether to create an index when writing VCF or coordinate sorted BAM output. | |
| --CREATE_MD5_FILE |
false | Whether to create an MD5 digest for any BAM or FASTQ files created. | |
| --MAX_RECORDS_IN_RAM |
500000 | When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed. | |
| --QUIET |
false | Whether to suppress job-summary info on System.err. | |
| --REFERENCE_SEQUENCE -R |
Reference sequence file. | ||
| --TMP_DIR |
One or more directories with space available to be used by this program for temporary storage of working files | ||
| --USE_JDK_DEFLATER -use_jdk_deflater |
false | Use the JDK Deflater instead of the Intel Deflater for writing compressed output | |
| --USE_JDK_INFLATER -use_jdk_inflater |
false | Use the JDK Inflater instead of the Intel Inflater for reading compressed input | |
| --VALIDATION_STRINGENCY |
STRICT | Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded. | |
| --VERBOSITY |
INFO | Control verbosity of logging. | |
| Advanced Arguments | |||
| --showHidden |
false | display hidden arguments | |
| Deprecated Arguments | |||
| --GENOTYPING_ERROR_RATE |
0.01 | (Deprecated) Assumed genotyping error rate that provides a floor on the probability that a genotype comes from the expected sample. Must be greater than zero. | |
Arguments in this list are specific to this tool. Keep in mind that other arguments are available that are shared with other tools (e.g. command-line GATK arguments); see Inherited arguments above.
Allow the use of duplicate reads in performing the comparison. Can be useful when duplicate marking has been overly aggressive and coverage is low.
boolean false
read one or more arguments files and add them to the command line
List[File] []
Specifies whether the Tumor-aware result should be calculated. These are time consuming and can roughly double the runtime of the tool. When crosschecking many groups not calculating the tumor-aware results can result in a significant speedup.
boolean true
Compression level for all compressed files created (e.g. BAM and VCF).
int 5 [ [ -∞ ∞ ] ]
Whether to create an index when writing VCF or coordinate sorted BAM output.
Boolean false
Whether to create an MD5 digest for any BAM or FASTQ files created.
boolean false
Specifies which data-type should be used as the basic comparison unit. Fingerprints from readgroups can be "rolled-up" to the LIBRARY, SAMPLE, or FILE level before being compared. Fingerprints from VCF can be be compared by SAMPLE or FILE.
The --CROSSCHECK_BY argument is an enumerated type (DataType), which can have one of the following values:
DataType READGROUP
An argument that controls how crosschecking with both INPUT and SECOND_INPUT should occur.
The --CROSSCHECK_MODE argument is an enumerated type (CrosscheckMode), which can have one of the following values:
CrosscheckMode CHECK_SAME_SAMPLE
When one or more mismatches between groups is detected, exit with this value instead of 0.
int 1 [ [ -∞ ∞ ] ]
When all LOD scores are zero, exit with this value.
int 1 [ [ -∞ ∞ ] ]
Expect all groups' fingerprints to match, irrespective of their sample names. By default (with this value set to false), groups (readgroups, libraries, files, or samples) with different sample names are expected to mismatch, and those with the same sample name are expected to match.
boolean false
(Deprecated)
Assumed genotyping error rate that provides a floor on the probability that a genotype comes from the expected sample. Must be greater than zero.
double 0.01 [ [ -∞ ∞ ] ]
The file lists a set of SNPs, optionally arranged in high-LD blocks, to be used for fingerprinting. See https://gatk.broadinstitute.org/hc/en-us/articles/360035531672-Haplotype-map-format for details.
R File null
display the help message
boolean false
One or more input files (or lists of files) with which to compare fingerprints.
R List[String] []
A tsv with two columns and no header which maps the input files to corresponding indices; to be used when index files are not located next to input files. First column must match the list of inputs.
File null
A tsv with two columns representing the sample as it should be used for comparisons to SECOND_INPUT (in the first column) and the source file (in INPUT) for the fingerprint (in the second column). Need only to include the samples that change. Values in column 1 should be unique even in union with the remaining unmapped samples. Values in column 2 should be unique in the file. Will error if more than one sample is found in a file (multi-sample VCF) pointed to in column 2. Should only be used in the presence of SECOND_INPUT.
Exclusion: This argument cannot be used at the same time as INPUT_SAMPLE_MAP.
File null
A tsv with two columns representing the sample as it appears in the INPUT data (in column 1) and the sample as it should be used for comparisons to SECOND_INPUT (in the second column). Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT.
Exclusion: This argument cannot be used at the same time as INPUT_SAMPLE_FILE_MAP.
File null
If any two groups (with the same sample name) match with a LOD score lower than the threshold the tool will exit with a non-zero code to indicate error. Program will also exit with an error if it finds two groups with different sample name that match with a LOD score greater than -LOD_THRESHOLD.
LOD score 0 means equal likelihood that the groups match vs. come from different individuals, negative LOD score -N, mean 10^N time more likely that the groups are from different individuals, and +N means 10^N times more likely that the groups are from the same individual.
double 0.0 [ [ -∞ ∞ ] ]
The rate at which a heterozygous genotype in a normal sample turns into a homozygous (via loss of heterozygosity) in the tumor (model assumes independent events, so this needs to be larger than reality).
double 0.5 [ [ -∞ ∞ ] ]
Optional output file to write matrix of LOD scores to. This is less informative than the metrics output and only contains Normal-Normal LOD score (i.e. doesn't account for Loss of Heterozygosity). It is however sometimes easier to use visually.
Exclusion: This argument cannot be used at the same time as SECOND_INPUT.
File null
Maximal effect of any single haplotype block on outcome (-log10 of maximal likelihood difference between the different values for the three possible genotypes).
double 3.0 [ [ 0 ∞ ] ]
When writing files that need to be sorted, this will specify the number of records stored in RAM before spilling to disk. Increasing this number reduces the number of file handles needed to sort the file, and increases the amount of RAM needed.
Integer 500000 [ [ -∞ ∞ ] ]
The number of threads to use to process files and generate fingerprints.
int 1 [ [ -∞ ∞ ] ]
Optional output file to write metrics to. Default is to write to stdout.
File null
If true, then only groups that do not relate to each other as expected will have their LODs reported.
boolean false
Whether to suppress job-summary info on System.err.
Boolean false
Reference sequence file.
PicardHtsPath null
A boolean value to determine whether input files should only be parsed if index files are available. Without turning this option on, the tool will need to read through the entirety of input files without index files either provided via the INPUT_INDEX_MAP or locally accessible relative to the input, which significantly increases runtime. If set to true and no index is found for a file, an exception will be thrown. This applies for both the INPUT and SECOND_INPUT files.
boolean false
A tsv with two columns representing the individual with which each sample is associated. The first column is the sample id, and the second column is the associated individual id. Values in the first column must be unique. If INPUT_SAMPLE_MAP or SECOND_INPUT_SAMPLE_MAP is also specified, then the values in the first column of this file should be the sample aliases specified in the second columns of INPUT_SAMPLE_MAP and SECOND_INPUT_SAMPLE_MAP, respectively. When this input is specified, expectations for matches will be based on the equality or inequality of the individual ids associated with two samples, as opposed to the sample ids themselves. Samples which are not listed in this file will have their sample id used as their individual id, for the purposes of match expectations. This means that one sample id could be used as the individual id for another sample, but not included in the map itself, and these two samples would be considered to have come from the same individual. Note that use of this parameter only affects labelling of matches and mismatches as EXPECTED or UNEXPECTED. It has no affect on how data is grouped for crosschecking.
File null
A second set of input files (or lists of files) with which to compare fingerprints. If this option is provided the tool compares each sample in INPUT with the sample from SECOND_INPUT that has the same sample ID. In addition, data will be grouped by SAMPLE regardless of the value of CROSSCHECK_BY. When operating in this mode, each sample in INPUT must also have a corresponding sample in SECOND_INPUT. If this is violated, the tool will proceed to check the matching samples, but report the missing samples and return a non-zero error-code.
Exclusion: This argument cannot be used at the same time as MATRIX_OUTPUT.
List[String] []
A tsv with two columns and no header which maps the second input files to corresponding indices; to be used when index files are not located next to second input files. First column must match the list of second inputs.
File null
A tsv with two columns representing the sample as it appears in the SECOND_INPUT data (in column 1) and the sample as it should be used for comparisons to INPUT (in the second column). Note that in case of unrolling files (file-of-filenames) one would need to reference the final file, i.e. the file that contains the genomic data. Need only include the samples that change. Values in column 1 should be unique. Values in column 2 should be unique even in union with the remaining unmapped samples. Should only be used with SECOND_INPUT.
File null
display hidden arguments
boolean false
One or more directories with space available to be used by this program for temporary storage of working files
List[File] []
Use the JDK Deflater instead of the Intel Deflater for writing compressed output
Boolean false
Use the JDK Inflater instead of the Intel Inflater for reading compressed input
Boolean false
Validation stringency for all SAM files read by this program. Setting stringency to SILENT can improve performance when processing a BAM file in which variable-length data (read, qualities, tags) do not otherwise need to be decoded.
The --VALIDATION_STRINGENCY argument is an enumerated type (ValidationStringency), which can have one of the following values:
ValidationStringency STRICT
Control verbosity of logging.
The --VERBOSITY argument is an enumerated type (LogLevel), which can have one of the following values:
LogLevel INFO
display the version number for this tool
boolean false
See also General Documentation | Tool Docs Index Tool Documentation Index | Support Forum
GATK version 4.6.2.0 built at Sun, 13 Apr 2025 13:21:43 -0400.