Up: Component summary Function

SGA

Perform de novo assembly for sequencing reads using String Graph Assembler (SGA). SGA is designed for experiments using a read length of 100 base pairs or more. Minimum recommended coverage is 20-30x.

The following SGA executables must be on path: sga, sga-align, sga-bam2de.pl and sga-astat.py. Also, abyss-fixmate, bwa and samtools need to be on path. Please follow SGA documentation for installation.

SGA parameters may need to be tuned for both performance and assembly quality reasons. The most important parameters are correctionK, minOverlapAssemble and minOverlapMerge. For machines with limited memory, indexBatchSize may need to be set. SGA is parallelized at two levels: assembly of a single genome is multithreaded, and multiple genomes can be assembled using a cluster.

Below are benchmarks for SGA 0.10.12 that provide hints on CPU and memory requirements:

Version 1.0
Bundle sequencing
Categories Assembly
Authors Kristian Ovaska (kristian.ovaska@helsinki.fi)
Issue tracker View/Report issues
Requires SGA ; BWA ; SAMtools ; ABySS ; pysam ; Ruffus ; Scala
Source files component.xml function.scala
Usage Example with default values

Inputs

Name Type Mandatory Description
reads Array<SequenceSet> Mandatory Sequencing reads. SGA accepts input as FASTA, FASTQ and their gzipped variants. When paired-end sequencing is used, either these files contain the paired reads after one another, or paired reads are in the second input.
mates Array<SequenceSet> Optional Paired sequencing reads. When paired-end sequencing is used and paired reads and not interlaced in the primary input, these files contain the paired reads in the same order as primary reads. The length of the mates array must be equal to the reads array and read-mate files must be in the same order.

Outputs

Name Type Description
contigs FASTA Assembled contigs, i.e., contiguous sequences constructed from the reads.
scaffolds FASTA Assembled scaffolds, i.e., concatenated contigs. In non-paired end mode (pairedEnd=false), this file is empty.

Parameters

Name Type Default Description
correctionK int 41 Error correction k-mer size. Corresponds to the -k argument of "sga correct".
indexBatchSize int 0 When constructing BWT indexes, this many reads are indexed in one batch, using disk-based BWT construction. If the value is 0, an efficient in-memory "ropebwt" algorithm is used; this is suitable for reads up to 200 bp and can index 1.5 billion reads using 64 GB memory. If the value is negative, an in-memory "sais" algorithm is used (suitable for long reads). For very large data sets, use a value in the range of 2-10 million.
memoryGB int 16 An estimate on the maximum amount of memory in gigabytes that an individual SGA process will need. This is only used in a cluster environment.
minBranchLength int 150 In the assembly step, branches shorter than this number of base pairs are removed. Corresponds to the -l argument in "sga assembly". The default is tuned of 100 bp reads; experiments using longer reads should increase this value.
minContigLength int 200 Minimum contig length in scaffolding. Corresponds to the -m argument in sga-bam2de.pl, sga-astat.py, "sga scaffold" and "sga scaffold2fasta".
minOverlap int 45 Minimum length of read overlap in the overlap computation step. Corresponds to the -m argument of "sga overlap". The value must be smaller than the read length.
minOverlapAssemble int 75 Minimum length of read overlap in the assembly step. Corresponds to the -m argument of "sga assemble". The value must be smaller than the read length. Too small values increase memory and CPU usage. SGA recommends a value of 75 for 100 bp reads.
minOverlapMerge int 65 Minimum length of read overlap in the FM index merge step. Corresponds to the -m argument of "sga fm-merge". The value must be smaller than the read length and minOverlapAssemble. Too small values increase memory and CPU usage. SGA recommends a value of 65 for 100 base pair reads.
pairedEnd boolean true If false, paired-end sequencing is not used. If true, paired reads are either contained in the primary input, or split between the primary and secondary inputs.
quality string "phred33" For FASTQ input files, type of base quality values present. Legal values are phred33 and phred64.
threads int 8 Maximum number of threads used by one process.

Test cases

Test case Parameters IN
reads
IN
mates
OUT
contigs
OUT
scaffolds
case1 (missing) reads (missing) contigs scaffolds
case2_split (missing) reads mates contigs scaffolds
case3_fq properties reads mates contigs scaffolds

quality = phred64

case4_unpaired properties reads (missing) contigs scaffolds

pairedEnd=false


Generated 2018-12-11 07:42:07 by Anduril 2.0.0