Up: Component summary Component

FastQScreen

FastQScreen allows you to screen a library of sequences in FastQ format against a set of sequence databases, for example vectors, virus or ribosomal RNA, so you can see if the composition of the library matches with what you expect.

FastQScreen is intended to be used as part of a QC pipeline. It allows you to take a sequence dataset and search it against a set of bowtie databases. It will then generate both a text and a graphical summary of the results to see if the sequence dataset contains the kind of sequences you expect.

Ideally the output will show a high percentage of reads that did not align to the sequences provided in dbList (since most of your reads should align to the genome of interest which should not be included in dbList).

To parallelize the component execution use the "custom_cpu" metadata annotation.

Version 1.0
Bundle sequencing
Categories Preprocessing
Authors Gabriele Partel (gabrielepartel@gmail.com)
Issue tracker View/Report issues
Requires Bowtie ; Bowtie2 ; GD::Graph ; installer (bash)
Source files component.xml main.sh
Usage Example with default values

Inputs

Name Type Mandatory Description
dbList CSV Mandatory CSV tab-separated file that allows you to configure multiple databases to search against in your screen. For each database you need to provide a database name (which can't contain spaces) and the location of the bowtie indices which you created for that database.
reads FASTQ Mandatory Reads in FASTQ format.
mates FASTQ Optional Mates in FASTQ format.

Outputs

Name Type Description
folder BinaryFolder Output folder.
NoHitPercentage CSV Percentage of reads that didn't align to the genomes provided

Parameters

Name Type Default Description
aligner string "bowtie2" Specify the aligner to use for the mapping. Valid arguments are 'bowtie' or 'bowtie2'.
bisulfite boolean false true when processing bisulfite libraries. Either conventional or bisulfite libraries may be specified, but not both simultaneously.
illumina1_3 boolean false If true assumes that the quality values are in encoded in Illumina v1.3 format. Defaults to Sanger format if false.
nohits boolean false If true writes to a file the sequences that did not map to any of the specified genomes. If the subset option is also specified, only reads from the temporary dataset that failed to align to the reference genomes will be written to the output file.
subset int 100000 Don't use the whole sequence file, but create a temporary dataset of this specified number of reads. The dataset created will be of approximately (within a factor of 2) of this size. If the real dataset is smaller than twice the specified size then the whole dataset will be used. Subsets will be taken evenly from throughout the whole original dataset. (To process all the data set to 0).

Test cases

Test case Parameters IN
dbList
IN
reads
IN
mates
OUT
folder
OUT
NoHitPercentage
case1 properties dbList reads mates (missing) (missing)

case2 (missing) dbList reads (missing) (missing) (missing)
case3 properties dbList reads (missing) (missing) (missing)

nohits=true


Generated 2018-12-11 07:42:06 by Anduril 2.0.0