abyss-sealer

abyss-sealer(1)                      ABySS                     abyss-sealer(1)



Name
       abyss-sealer - Close gaps within scaffolds

Synopsis
       abyss-sealer -b <Bloom filter size> -k <kmer size> -k <kmer size>... -o <output_prefix> -S <path to scaffold file> [options]... <reads1> [reads2]...

       For example:

       abyss-sealer -b20G -k64 -k80 -k96 -k112 -k128 -o test -S scaffold.fa read1.fa read2.fa

Description
       Sealer is an application of Konnector that closes intra-scaffold gaps.
       It performs three sequential functions.  First, regions with Ns are
       identified from an input scaffold.  Flanking nucleotues (2 x 100bp) are
       extracted from those regions while respecting the strand (5' to 3')
       direction on the sequence immediately downstream of each gap.  In the
       second step, flanking sequence pairs are used as input to Konnector
       along with a set of reads with a high level of coverage redundancy.
       Ideally, the reads should represent the original dataset from which the
       draft assembly is generated, or further whole genome shotgun (WGS)
       sequencing data generated from the same sample.  Within Konnector, the
       input WGS reads are used to populate a Bloom filter, tiling the reads
       with a sliding window of length k, thus generating a probabilistic
       representation of all the k-mers in the reads.  Konnector also uses
       crude error removal and correctional algorithms, eliminating singletons
       (k-mers that are observed only once) and fixing base mismatches in the
       flanking sequence pairs.  Sealer launches Konnector processes using a
       user-input range of k-mer lengths.  In the third and final operation,
       succesfully merged sequences are inserted into the gaps of the original
       scaffolds, and Sealer outputs a new gap-filled scaffold file.

Installation
       See ABySS installation instructions.

How to run as stand-alone application
       abyss-sealer [-b bloom filter size][-k values...] [-o outputprefix] [-S assembly file] [options...] [reads...]

       Sealer requires the following information to run: - draft assembly -
       user-supplied k values (>0) - output prefix - WGS reads (for building
       Bloom Filters)

Sample commands
       Without pre-built bloom filters:

       abyss-sealer -b20G -k64 -k96 -o run1 -S test.fa read1.fq.gz read2.fq.gz

       With pre-built bloom filters:

       abyss-sealer -k64 -k96 -o run1 -S test.fa -i k64.bloom -i k96.bloom read1.fq.gz read2.fq.gz

       Reusable Bloom filters can be pre-built with abyss-bloom build, e.g.:

       abyss-bloom build -vv -k64 -j12 -b20G -l2 k64.bloom read1.fq.gz read2.fq.gz

       Note: when using pre-built bloom filters generated by
       abyss-bloom build, Sealer must be compiled with the same maxk value
       that abyss-bloom was compiled with.  For example, if a Bloom filter was
       built with a maxk of 64, Sealer must be compiled with a maxk of 64 as
       well.  If different values are used between the pre-built bloom filter
       and Sealer, any sequences generated will be nonsensical and incorrect.

Output files
       · prefix_log.txt

       · prefix_scaffold.fa

       · prefix_merged.fa

       · prefix_flanks_1.fq -> if --print-flanks option used

       · prefix_flanks_2.fq -> if --print-flanks option used

       The log file contains results of each Konnector run.  The structure of
       one run is as follows:

       · ## unique gaps closed for k##

       · No start/goal kmer: ###

       · No path: ###

       · Unique path: ###

       · Multiple paths: ###

       · Too many paths: ###

       · Too many branches: ###

       · Too many path/path mismatches: ###

       · Too many path/read mismatches: ###

       · Contains cycle: ###

       · Exceeded mem limit: ###

       · Skipped: ###

       · ### flanks left

       · k## run complete

       · Total gaps closed so far = ###

       The scaffold.fa file is a gap-filled version of the draft assembly
       inserted into Sealer.  The merged.fa file contains every newly
       generated sequence that were inserted into gaps, including the flanking
       sequences.  Negative sizes of new sequences indicate Konnector
       collapsed the pair of flanking sequences.  For example:

       >[scaffold ID]_[original start position of gap on scaffold]_[size of
       new sequence] ACGCGACGAGCAGCGAGCACGAGCAGCGACGAGCGACGACGAGCAGCGACGAGCG

       If --print-flanks option is enabled, Sealer outputs the flanking
       sequences used to insert into Konnector.  This may be useful should
       users which to double check if this tool is extracting the correct
       sequences surrounding gaps.  The structure of these files are as
       follows:

       >[scaffold ID]_[original start position of gap on scaffold]_[size of
       gap]/[1 or 2 indicating whether left or right flank]
       GCTAGCTAGCTAGCTGATCGATCGTAGCTAGCTAGCTGACTAGCTGATCAGTCGA

How to optimize for gap closure
       To optimize Sealer, users can observe the log files generated after a
       run and adjust parameters accordingly.  If k runs are showing gaps
       having too many paths or branches, consider increasing -P or -B
       parameters, respectively.

       Also consider increasing the number of k values used.  Generally, large
       k-mers are better able to address highly repetitive genomic regions,
       while smaller k-mers are better able to resolve areas of low coverage.

Runtime and memory usage
       More k values mean more bloom filters will be required, which will
       increase runtime as it takes time to build/load each bloom filter at
       the beginning of each k run.  Memory usage is not affected by using
       more bloom filters.

       The larger value used for parameters such as -P, -B or -F will increase
       runtime.

Options
       Parameters of abyss-sealer

       · --print-flanks: outputs flank files

       · -S,--input-scaffold=FILE: load scaffold from FILE

       · -L,--flank-length=N: length of flanks to be used as pseudoreads [100]

       · -j,--threads=N: use N parallel threads [1]

       · -k,--kmer=N: the size of a k-mer

       · -b,--bloom-size=N: size of bloom filter.  Required when not using
         pre-built Bloom filter(s).

       · -B,--max-branches=N: max branches in de Bruijn graph traversal; use
         'nolimit' for no limit [1000]

       · -d,--dot-file=FILE: write graph traversals to a DOT file

       · -e,--fix-errors: find and fix single-base errors when reads have no
         kmers in bloom filter [disabled]

       · -f,--min-frag=N: min fragment size in base pairs [0]

       · -F,--max-frag=N: max fragment size in base pairs [1000]

       · -i,--input-bloom=FILE: load bloom filter from FILE

       · --mask: mask new and changed bases as lower case

       · --no-mask: do not mask bases [default]

       · --chastity: discard unchaste reads [default]

       · --no-chastity: do not discard unchaste reads

       · --trim-masked: trim masked bases from the ends of reads

       · --no-trim-masked: do not trim masked bases from the ends of reads
         [default]

       · -l,--long-search: start path search as close as possible to the
         beginnings of reads.  Takes more time but improves results when bloom
         filter false positive rate is high [disabled]

       · -m,--flank-mismatches=N`: max mismatches between paths and flanks;
         use 'nolimit' for no limit [nolimit]

       · -M,--max-mismatches=N`: max mismatches between all alternate paths;
         use 'nolimit' for no limit [nolimit]

       · -n--no-limits`: disable all limits; equivalent to '-B nolimit -m
         nolimit -M nolimit -P nolimit'

       · -o,--output-prefix=FILE`: prefix of output FASTA files [required]

       · -P,--max-paths=N`: merge at most N alternate paths; use 'nolimit' for
         no limit [2]

       · -q,--trim-quality=N`: trim bases from the ends of reads whose quality
         is less than the threshold

       · --standard-quality: zero quality is `!' (33) default for FASTQ and
         SAM files

       · --illumina-quality: zero quality is `@' (64) default for qseq and
         export files

       · -r,--read-name=STR`: only process reads with names that contain STR

       · -s,--search-mem=N`: mem limit for graph searches; multiply by the
         number of threads (-j) to get the total mem used for graph traversal
         [500M]

       · -t,--trace-file=FILE`: write graph search stats to FILE

       · -v,--verbose`: display verbose output

       · --help: display this help and exit

       · --version: output version information and exit

       k is the size of k-mer for the de Bruijn graph.  You may specify
       multiple values of k, which will increase the nubmer of gaps closed at
       the cost of increased run time.  Multiple values of k ought to be
       specified in increasing order, as lower values of k have fewer coverage
       gaps and are less likely to misassemble.

       P is the threshold for number of paths allowed to be traversed.  When
       set to 10, Konnector will attempt to close gaps even when there are 10
       different paths found.  It would attempt to create a consensus sequence
       between these paths.  The default setting is 2.

AUTHORS
       Daniel Paulino.



ABySS                             2014-11-13                   abyss-sealer(1)