Up: Component summary Function

VariantRecalibrator

This function will apply machine learning in order to improve the input variants. The input variant files are assumed to be "raw" in the sense, that they are straight from the caller. The Genome Analysis Toolkit (GATK) is used along with several background ("true site") files that can be downloaded the GATK resource bundle. For more information about the specific annotations available, please see the GATK documentation. The default annotations used here are the recommended ones for most data sets.

The two step procedure of VariantRecalibrator:

Complete documentation:

Also check out the additional discussion on VQSR and the FAQ describing the recommended arguments and training sets.

Version 1.0
Bundle sequencing
Categories VariationAnalysis
Authors Rony Lindell (rony.lindell@helsinki.fi)
Issue tracker View/Report issues
Source files component.xml function.scala
Usage Example with default values

Inputs

Name Type Mandatory Description
reference FASTA Mandatory The reference fasta file.
variants VCF Optional Input (merged) vcf file. See 'files' parameter for adding multiple files.

The file can be a single-sample or a merged multi-sample vcf file.
hapmap VCF Optional File with very high confidence hapmap training data.
omni VCF Optional File with true polymorphic SNP sites from the Omni genotyping array.
hcsnp VCF Optional File with high confidence snps from the 1000 Genomes project.
dbsnp VCF Optional File with lower confidence SNPs from latest dbSNP distribution.
mills VCF Optional File with indel high confidence training data from the Mills dataset.

Outputs

Name Type Description
calls VCF Final recalibrated vcf file.

Parameters

Name Type Default Description
capture boolean true This will make various parameters specific for exome sequencing (or other similar "capture" technology). If 'false', the data will be assumed to be whole-genome or similar.
files string "" A "-input"-tag separated list of paths to multiple vcf files (single- or multi-sample), e.g. files="-input FILE1.vcf -input FILE2.vcf, ... -input FILEN.vcf".
gatk string "" Path to GATK directory containing the 'GenomeAnalysisTK.jar' file. If empty string is given (default), GATK_HOME environment variable is assumed to point to the GATK directory where GenomeAnalysisTK.jar is located.
indelAnno string "QD,FS,HaplotypeScore,ReadPosRankSum,InbreedingCoeff,MQRankSum" Names of the annotations that will be used in the indel model given in a comma-separated list. Note that MQ (RMS mapping quality) and MQRankSum should usually be left out in the indel model.
memory string "4g" The amount of java-heap memory being allocated to the GATK thread, given in the format "4g" for 4 gigabytes or "2560m" for 2560 megabytes (2,5g) etc.
snpAnno string "QD,HaplotypeScore,MQRankSum,ReadPosRankSum,FS,MQ,InbreedingCoeff" Names of the annotations that will be used in the snp model given in a comma-separated list. Note that DP (depth of coverage) should not be used for capture data (e.g. exome). DP annotation will however be automatically added to the list when 'capture' is false.
threads int 1 The amount parallelized threads that are allocated to each run.
truth float 99.0 Level of true variant probability at which to start filtering. A lower value should add to the sensitivity but decrease the specifity.

Test cases

Test case Parameters IN
reference
IN
variants
IN
hapmap
IN
omni
IN
hcsnp
IN
dbsnp
IN
mills
OUT
calls
case1 properties reference variants hapmap omni (missing) dbsnp mills (expecting failure)

# Run using less memory and simple annotations,
memory=1g,
snpAnno=MQ,
indelAnno=ReadPosRankSum


Generated 2018-12-12 07:42:06 by Anduril 2.0.0