Plugin for PIA output into MEGAN or non-redundant/incomplete database MEGAN compensation feature

I was curious about the possibility of creating a plugin for MEGAN that can generate a kind of pseudo-rma6 file from the summary output of PIA (https://github.com/Allaby-lab/PIA). The rationale in this case is that I’m observing a number of potential false-positive hits in MEGAN (namely plants) that I suspect are driven in part by database incompleteness (as similarly reported recently in the PIA paper: https://doi.org/10.3389/fevo.2020.00084). I like the MEGAN LCA approach, but would be interested in being able to visualize both program’s taxon node classifications together in a MEGAN comparison file that can be easily visualized as a kind of ensemble means of support for a node’s classification (insofar theoretically as nodes classified both with PIA and MEGAN as having the best support as being ‘real’). That, or if there is a means of incorporating an additional PIA-like approach for dealing with database incompleteness and/or non-redundant databases directly in MEGAN to aid in mitigating database limitations driving the potential for inaccurate classifications.

Thanks!

1 Like

Sorry, I should clarify, I meant to write a means of compensating for redundant databases. Not ‘non-redundant’ as I wrote above.

Thank you for pointing out the PIA paper.

MEGAN allows import of results in a number of different CSV formats. If none of these are suitable, then please send me a typical output file for PIA and I will write an importer for it.

I took a look at the PIA paper and the ideas and I think that it would be good to incorporate the algorithm, or parts of it, into MEGAN. I will look into this further…

1 Like

Thanks Daniel! I tried using the CSV importer to import the file into MEGAN, but I haven’t been successful thus far. I’ve uploaded an example of the PIA output here.
PIA-example-output.fasta.header_out.intersects.txt (1.4 MB)

You can use the linux `sed’ program to create a new CSV file, which can then be imported using MEGAN’s File->Import->Text File menu item.

For example, this call extracts the taxon id provided as “taxonomic range”:

sed "s/Query: \(.*\), top hit:.*range: [a-zA-Z0-9. ]* (\([0-9]*\)).*/\1,\2,50/ "

whereas this extracts the taxon id provided as “phylogenetic intersection”:

sed "s/Query: \(.*\), top hit:.*phylogenetic intersection: [a-zA-Z0-9. ]* (\([0-9]*\)).*/\1,\2,50/ "

In both cases, the command extracts the read name and the taxon id, and appends a fake bitscore of 50, and writes out in comma-separated format: read,taxon-id,50.

For example, the line

Query: SNL153:253:h5jm7bcx3:1:2111:13277:53133, top hit: cellular organisms (131567), expect: 7.71e-08, identities: 100.000, next hit: Agrobacterium sp. ATCC 31749 (82789), last hit: Agrobacterium sp. ATCC 31749 (82789), taxon count: 2, phylogenetic range: cellular organisms (131567), raw hit count: 38, taxonomic diversity (up to cap if met): 30, taxonomic diversity score: 0.0029, phylogenetic intersection: cellular organisms (131567)

is transformed into

SNL153:253:h5jm7bcx3:1:2111:13277:53133,131567,50

by both the first and second sed command.

This shows you how to setup the import command and what the resulting tree looks like:

Perfect, that works great! Thanks so much for your help.

Just as a quick aside, I had an error with a couple reads where the hit was to:

phylogenetic intersection: Poeae Chloroplast Group 2 (Poeae type) (1652081)

In this case, the double parenthetic statement was creating a misformatted line when using the sed command from above, and then those hits to ‘Poeae Chloroplast Group 2’ weren’t getting counted. In case this is of use to others, I modified that command slightly and it seems to work fine now. I barely understand how sed works to be honest, so I’m happy that this small modification sorts out that problem, haha:

sed "s/Query: \(.*\), top hit:.phylogenetic intersection: [a-zA-Z0-9. ]* (\([0-9]\)).*/\1,\2,50/ "
#original sed command from previous post

sed "s/Query: \(.*\), top hit:.phylogenetic intersection: [a-zA-Z0-9(). ]* (\([0-9]\)).*/\1,\2,50/ "
#added parentheses to fix taxon nodes with double parenthetic statements

But yeah, if there was a means of incorporating elements of a PIA-like algorithm into MEGAN in the future, that would be helpful for dealing with incomplete and redundant databases.

Hello both,

I’m currently maintaining PIA. Making its output readable by MEGAN sounds very useful, so I followed @Daniel’s lead and wrote a little script that converts Summary Basics or Summary Reads into appropriate CSVs. The intersects file is less useful because those reads haven’t been through the final filter - it’s mostly for investigating when things go wrong.

The script is up on our PIA accessories GitHub page (https://github.com/Allaby-lab/PIA-accessories), which has now had two very helpful contributions from Tyler. Thanks!

Finally, great to hear you found our paper interesting, Daniel. Please feel free to get in touch. We like talking about it.

Hope that helps,
Becky

Datenschutzerklärung