Alignment of long reads by MALT


#1

Hello,

I want to use MEGAN on NexSeq 2x150 reads that I have assembled (R1 and R2 files trimmed and assembled together), thus obtaining longer reads. I’ve read that Diamond is not recommended for long reads. Actually, I’ve already employed Diamond on my datasets and only a low proportion of reads are assigned. My question is whether MALT could be used on long reads. I’ve read that LAST is also recommended for long reads.
Looking forward to a reply. Thanks in advance.


#2

DIAMOND will do fine on those reads. If you are not getting any alignments against NR, then it is not a problem of the aligner.

Long reads start at around 500bp, say.
And for those, DIAMOND and new options for dealing with them.


#3

Hello Daniel,

I’ve been checking the contigs length and they range from 300 to 2,000 bp, sometimes reaching up to 10,000 bp. Isn’t that considered long reads? In that case, should I use LAST or MALT?

I’d also like to ask you another thing (it may be pretty basic). My R1.fastq file contains 3,412,750 sequences (152 bp). After diamond blastx (against NCBI-nr) the number of queries aligned is 1,653,102. Doesn’t this represent a too low percentage of the total reads? This is the output from diamond blastx:

Reported 27513386 pairwise alignments, 27513529 HSPs.
1653102 queries aligned.


#4

Use DIAMOND with the following options:

-F 15 --range-culling --top 10

-F 15 will trigger frame-shift aware alignments (rather than straight translated alignments), this if you suspect that there may be frame-shift causing errors in your long sequences. This is probably less the case for contigs, more the case for long reads (Nanopore or PacBio).

–rangeCulling and --top 10 will change the way DIAMOND decides which alignments to report: range culling turned on will cause DIAMOND to report alignments along the whole length of the read, locally reporting all alignments with 10% of the best local bit score (–top 10). If this is not turned on, then DIAMOND will report the best scoring alignments globally for the read, which will tend to concentrate on one region of the query (the most conserved gene).

There is no publication on this yet, but in our hands running DIAMOND in this mode is slightly more sensitive than running LAST.


#5

This depends on the environment that you sampled. For a human gut sample, this would be low, but for a soil sample, that seems right.