Up: Component summary Function

QCFasta

Quality control function for RNA-Seq data. It takes an array of fastq paired or single-end reads and performs and filters the low quality sequences from it. The function carries out the following steps:

Three tools are available for quality trimming: TrimGalore, Trimmomatic, and FastX. All tools remove adapters, trim low quality bases at both ends, and discard too short sequences. FastX only works with single-end data; TrimGalore and Trimmomatic can be used for both single and paired end. TrimGalore has options for RRBS libraries. Trimmomatic is faster than trimGalore and has a sliding window that is useful to trim sequences that have low quality bases that are not at the extremes of the sequence. FastX seems to be better at removing known adaptors in miRNA sets.

Most parameters work with all tools in the same way, but some have slightly different behavior (ex. stringency) or are specific to one of the tools.

QCFasta may have issues creating the quality report if the keys for the samples are single numbers from 1-10 (it is recommended to use keys that are clearly a string, i.e. not just numbers). If the fastQCfolders input is used then the parameters readkey and mateKey need to be set. This parameters are a substring of the names FastQC gave to the folders containing the reads and mates statistics. Usually reads have "_1.fq" ending while mates end in "_2.fq". In that case readKey="_1" and mateKey="_2". If you used SeqQC then the word "read" and "mate" were added to the folder names and in that case you do not need to modify readKey and mateKey since "read" and "mate" are the default values for them.

Overview of tricky parameters: More accurate descriptions of the parameters can be found in the documentation of each tool or on the individual components for Trimmomatic, TrimGalore, and Fastx (SmallRNAprep).

If you add more files to your input, and then you rerun QCFasta it will restart the whole thing. Changes in the input array propagate from the beginning. Maybe this can be improved later. Changing trimming parameters for example, in principle, should not trigger a rerun of the first part (the first call to fastqc) just from the trimming steps on.

Version 5.0
Bundle sequencing
Categories
Authors Alejandra Cervera (alejandra.cervera@helsinki.fi), Erkka Valo (erkka.valo@helsinki.fi)
Issue tracker View/Report issues
Source files component.xml function.scala
Usage Example with default values

Inputs

Name Type Mandatory Description
reads Array<FASTQ> Mandatory An array of reads.
mates Array<FASTQ> Optional An array of mates for the reads if the data is paired-end. The matching of reads to mates is done by the array keys.
fastQCfolders Array<BinaryFolder> Optional If FastQC output is available you can skip running it again by providing the output folders as an array.
adapter FASTA Optional Adapter file in fasta format that can only be used with trimmomatic, either single-cell or bulk.

Outputs

Name Type Description
qcReads Array<FASTQ> An array of sequences that passed the quality control step
qcMates Array<FASTQ> An array sequences that passed the quality control step
report HTML Quality control report
table CSV Statistics of all processed samples.
qcUnpairedReads Array<FASTQ> An array of sequences of which only the read passed the quality control step
qcUnpairedMates Array<FASTQ> An array of sequences of which only the mate passed the quality control step

Parameters

Name Type Default Description
adapterSeq string "" Adapter specified directly as a string for TrimGalore or FastX; for Trimmomatic you can either provide the fasta file as input or specify here the Illumina adapter to use: TruSeq2-SE.fa, TruSeq2-PE.fa, TruSeq3-SE.fa, or TruSeq3-PE.fa.
crop int -1 Trim bases at the end of the read so it maximally has the crop size, only for Trimmomatic and FastX. The default value (-1) only works for Trimmomatic, for FastX it needs to be set to the length desired (ex. 32).
extra string "" Extra parameters for trim Galore! or for FastX.
gzip boolean false Defines if the output sequences should be gzipped or not.
headcrop int 0 The number of bases to remove from the start of the read. No trimming is done if the value is set to 0.
isSinglecell boolean false Define if the read files are from Linnarson's single cell RNA-seq protocol (STRT).
keepBothReads boolean false Defines if keep the reverse reads after read-though has been detected by palindrome mode, and the adapter sequence removed, the reverse read contains the same sequence information as the forward read, albeit in reverse complement. Only for Trimmomatic.
mateKey string "mate" Key to identify mates from reads in the FastQCfolders
minLength int 20 Reads shorter than minLength will be removed. In paired-end sequencing also the corresponding mate is removed. No trimming is done if the value is set to 0.
minPercent int 20 Minimum percentage of bases that must have at least minQuality for a read to be kept. Only FastX.
minQuality int 20 Bases below this quality threshold will be trimmed from the 5' end of the sequence.
palindromeClip int 30 Specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment. Only Trimmomatic.
percent float 0.3 Percentage of good quality reads needed to keep the file.
qual string "" Quality version used by the sequencer (phred33 or phred64). If emtpy we use FastQC to guess the encoding.
readKey string "read" Key to identify reads from mates in the FastQCfolders
simpleClip int 12 A threshold specifies how accurate the match between any adapter must be against a read. Each matching base adds just over 0.6. Only Trimmomatic.
slidingWindow string "null" Sliding window trimming where the sequence is cut if the average quality of the bases within the sliding window falls below the defined threshold. A string specifies the window size and the average required quality in the sliding window. The format is windowSize:requiredQuality. For example, 4:15 (window size = 4; required quality = 15). No trimming is done if value is set to 'null'. Only Trimmomatic.
stringency int 2 Minimum overlap of sequence with the adapter for the bases to be trimmed (TrimGalore and FastX); allowed mismatches with the adaptor (Trimmomatic).
temSwitPrimer string "GGG" minimal template-switching generated Gs
threads int 2 Number of threads to use for the multi-threading components.
tool string "trimmomatic" Choose trimmomatic, trimGalore or fastx for adapter removal and quality trimming
trailing int 30 Remove bases from the end of the read, if quality value is below the given threshold. Only Trimmomatic.
umiLen int 6 The length of unique molecular identifiers (UMIs), only used when isSingleCell is true.
umiSliding string "1:17" slidingWindow for UMIs. For example, when UMIsliding = 1:17, any UMI bases with a quality lower than 17 will be removed. Only used when isSingleCell is true.

Test cases

Test case Parameters IN
reads
IN
mates
IN
fastQCfolders
IN
adapter
OUT
qcReads
OUT
qcMates
OUT
report
OUT
table
OUT
qcUnpairedReads
OUT
qcUnpairedMates
case1 properties reads (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

readKey=_1,
tool=fastx,
crop=20

case2 properties reads (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

readKey=_1,
mateKey= _mates,
tool=trimmomatic

case3 properties reads (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

readKey=_1,
mateKey= _mates,
tool=trimGalore

case4 properties reads mates (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

readKey =_reads,
mateKey = _mates,
tool=trimGalore

case5 properties reads mates (missing) (missing) (missing) (missing) (missing) (missing) (missing) (missing)

readKey =_reads,
mateKey = _mates,
tool=trimmomatic

case6 properties reads mates fastQCfolders (missing) (missing) (missing) (missing) (missing) (missing) (missing)

tool=trimmomatic


Generated 2018-12-18 07:42:34 by Anduril 2.0.0