PR2 Database Compatibility


#1

Hello,

I’d like to request MEGAN6 be able to handle BLAST results from using the Protist Ribosomal Reference database (PR2) (https://figshare.com/articles/PR2_rRNA_gene_database/3803709). The sequence header always contains 8 taxonomic fields which may make finding taxonomy easier but I noticed MEGAN6 doesn’t recognize the species names because they are joined by an underscore. Also at least in one case MEGAN6 assigned a sequence too specifically but this may be because of taxonomic disagreement between NCBI and PR2 (it assigned to the genera Thalassema while one hit was to Thlassema thalassemum and the other to Arhinchite pugettensis).

Thank you,
Katie


#2

Hi Katie,

I have added code that first attempts to match a name “as is” and then tries again with all underscores replaced by spaces.

This will work nearly as well as when spaces are spaces, except that MEGAN tries to match subsets of words in a name if it can not match all words. Using underscores to tie words together will interfere with this and so it will be “all or nothing”, e.g. if the name is represented by X_Y_Z, then MEGAN will try to match X Y Z to some name in the NCBI taxonomy, but not X Y, Y Z, then X, Y or Z, as it would usually do…
D