abyss-sealer








     abyss−sealer − Close gaps within scaffolds



     abyss−sealer −b <Bloom filter size> −k <kmer size> −k <kmer size>... −o <output_prefix> −S <path to scaffold file> [options]... <reads1> [reads2]...

     For example:

     abyss−sealer −b20G −k64 −k80 −k96 −k112 −k128 −o test −S scaffold.fa read1.fa read2.fa



     Sealer is an application of Konnector that closes
intra−scaffold gaps.  It performs three sequential
functions.  First, regions with Ns are identified from an
input scaffold.  Flanking nucleotues (2 x 100bp) are
extracted from those regions while respecting the strand (5'
to 3') direction on the sequence immediately downstream of
each gap.  In the second step, flanking sequence pairs are
used as input to Konnector along with a set of reads with a
high level of coverage redundancy.  Ideally, the reads
should represent the original dataset from which the draft
assembly is generated, or further whole genome shotgun (WGS)
sequencing data generated from the same sample.  Within
Konnector, the input WGS reads are used to populate a Bloom
filter, tiling the reads with a sliding window of length k,
thus generating a probabilistic representation of all the
k−mers in the reads.  Konnector also uses crude error
removal and correctional algorithms, eliminating singletons
(k−mers that are observed only once) and fixing base
mismatches in the flanking sequence pairs.  Sealer launches
Konnector processes using a user−input range of k−mer
lengths.  In the third and final operation, succesfully
merged sequences are inserted into the gaps of the original
scaffolds, and Sealer outputs a new gap−filled scaffold
file.



     See ABySS installation instructions.



     abyss−sealer [−b bloom filter size][−k values...] [−o outputprefix] [−S assembly file] [options...] [reads...]

     Sealer requires the following information to run: −
draft assembly − user−supplied k values (>0) − output prefix
− WGS reads (for building Bloom Filters)













                             ‐2‐




     Without pre−built bloom filters:

     abyss−sealer −b20G −k64 −k96 −o run1 −S test.fa read1.fq.gz read2.fq.gz

     With pre−built bloom filters:

     abyss−sealer −k64 −k96 −o run1 −S test.fa −i k64.bloom −i k96.bloom read1.fq.gz read2.fq.gz

     Reusable Bloom filters can be pre−built with
abyss−bloom build, e.g.:

     abyss−bloom build −vv −k64 −j12 −b20G −l2 k64.bloom read1.fq.gz read2.fq.gz

     Note: when using pre−built bloom filters generated by
abyss−bloom build, Sealer must be compiled with the same
maxk value that abyss−bloom was compiled with.  For example,
if a Bloom filter was built with a maxk of 64, Sealer must
be compiled with a maxk of 64 as well.  If different values
are used between the pre−built bloom filter and Sealer, any
sequences generated will be nonsensical and incorrect.



• prefix_log.txt

• prefix_scaffold.fa

• prefix_merged.fa

• prefix_flanks_1.fq −> if −−print−flanks option used

• prefix_flanks_2.fq −> if −−print−flanks option used

     The log file contains results of each Konnector run.
The structure of one run is as follows:

• ## unique gaps closed for k##

• No start/goal kmer: ###

• No path: ###

• Unique path: ###

• Multiple paths: ###

• Too many paths: ###

• Too many branches: ###

• Too many path/path mismatches: ###










                             ‐3‐


• Too many path/read mismatches: ###

• Contains cycle: ###

• Exceeded mem limit: ###

• Skipped: ###

• ### flanks left

• k## run complete

• Total gaps closed so far = ###

     The scaffold.fa file is a gap−filled version of the
draft assembly inserted into Sealer.  The merged.fa file
contains every newly generated sequence that were inserted
into gaps, including the flanking sequences.  Negative sizes
of new sequences indicate Konnector collapsed the pair of
flanking sequences.  For example:

     >[scaffold ID]_[original start position of gap on
scaffold]_[size of new sequence]
ACGCGACGAGCAGCGAGCACGAGCAGCGACGAGCGACGACGAGCAGCGACGAGCG

     If −−print−flanks option is enabled, Sealer outputs the
flanking sequences used to insert into Konnector.  This may
be useful should users which to double check if this tool is
extracting the correct sequences surrounding gaps.  The
structure of these files are as follows:

     >[scaffold ID]_[original start position of gap on
scaffold]_[size of gap]/[1 or 2 indicating whether left or
right flank]
GCTAGCTAGCTAGCTGATCGATCGTAGCTAGCTAGCTGACTAGCTGATCAGTCGA



     To optimize Sealer, users can observe the log files
generated after a run and adjust parameters accordingly.  If
k runs are showing gaps having too many paths or branches,
consider increasing −P or −B parameters, respectively.

     Also consider increasing the number of k values used.
Generally, large k−mers are better able to address highly
repetitive genomic regions, while smaller k−mers are better
able to resolve areas of low coverage.



     More k values mean more bloom filters will be required,
which will increase runtime as it takes time to build/load
each bloom filter at the beginning of each k run.  Memory
usage is not affected by using more bloom filters.









                             ‐4‐


     The larger value used for parameters such as −P, −B or
−F will increase runtime.



     Parameters of abyss−sealer

• −−print−flanks: outputs flank files

• −S,−−input−scaffold=FILE: load scaffold from FILE

• −L,−−flank−length=N: length of flanks to be used as
  pseudoreads [100]

• −j,−−threads=N: use N parallel threads [1]

• −k,−−kmer=N: the size of a k−mer

• −b,−−bloom−size=N: size of bloom filter.  Required when
  not using pre−built Bloom filter(s).

• −B,−−max−branches=N: max branches in de Bruijn graph
  traversal; use 'nolimit' for no limit [1000]

• −d,−−dot−file=FILE: write graph traversals to a DOT file

• −e,−−fix−errors: find and fix single−base errors when
  reads have no kmers in bloom filter [disabled]

• −f,−−min−frag=N: min fragment size in base pairs [0]

• −F,−−max−frag=N: max fragment size in base pairs [1000]

• −i,−−input−bloom=FILE: load bloom filter from FILE

• −−mask: mask new and changed bases as lower case

• −−no−mask: do not mask bases [default]

• −−chastity: discard unchaste reads [default]

• −−no−chastity: do not discard unchaste reads

• −−trim−masked: trim masked bases from the ends of reads

• −−no−trim−masked: do not trim masked bases from the ends
  of reads [default]

• −l,−−long−search: start path search as close as possible
  to the beginnings of reads.  Takes more time but improves
  results when bloom filter false positive rate is high
  [disabled]











                             ‐5‐


• −m,−−flank−mismatches=N‘: max mismatches between paths and
  flanks; use 'nolimit' for no limit [nolimit]

• −M,−−max−mismatches=N‘: max mismatches between all
  alternate paths; use 'nolimit' for no limit [nolimit]

• −n−−no−limits‘: disable all limits; equivalent to '−B
  nolimit −m nolimit −M nolimit −P nolimit'

• −o,−−output−prefix=FILE‘: prefix of output FASTA files
  [required]

• −P,−−max−paths=N‘: merge at most N alternate paths; use
  'nolimit' for no limit [2]

• −q,−−trim−quality=N‘: trim bases from the ends of reads
  whose quality is less than the threshold

• −−standard−quality: zero quality is ‘!' (33) default for
  FASTQ and SAM files

• −−illumina−quality: zero quality is ‘@' (64) default for
  qseq and export files

• −r,−−read−name=STR‘: only process reads with names that
  contain STR

• −s,−−search−mem=N‘: mem limit for graph searches; multiply
  by the number of threads (−j) to get the total mem used
  for graph traversal [500M]

• −t,−−trace−file=FILE‘: write graph search stats to FILE

• −v,−−verbose‘: display verbose output

• −−help: display this help and exit

• −−version: output version information and exit

     k is the size of k−mer for the de Bruijn graph.  You
may specify multiple values of k, which will increase the
nubmer of gaps closed at the cost of increased run time.
Multiple values of k ought to be specified in increasing
order, as lower values of k have fewer coverage gaps and are
less likely to misassemble.

     P is the threshold for number of paths allowed to be
traversed.  When set to 10, Konnector will attempt to close
gaps even when there are 10 different paths found.  It would
attempt to create a consensus sequence between these paths.
The default setting is 2.












                             ‐6‐


Daniel Paulino.