Incorrect assignments to InterPro families


#1

I’m using MEGAN CE version 6.12.1 on Linux 16.04.4. I analyzed reads from a microbiome sample with diamond in blastx mode using a very recent download of the ncbi nr database. The daa file was meganized using acc2interpro-June2018X.abin and prot_acc2tax-June2018X1.abin along with the other older abin files. I was interested in a particular functional class, namely IPR010662 (putative hydrolase RPBP9/YdeN). I extracted the reads with this classification and ran again through diamond and meganized the daa file.

Of the 1412 reads extracted, 1393 were again assigned to IPR010662. A close examination of the hits showed that many were incorrectly assigned. For example, here is the output from the read inspector for one of the hits:

NB501071:77:HFYMFBGX3:1:11101:21312:8141 [length=150, matches=10]
DATA[length=150]
Pseudomonas; score=104.0
Pseudomonas fluorescens; COG1466 DNA polymerase III (delta’ subunit); score=104.0
Pseudomonas; IPR010662 Putative hydrolase RBBP9/YdeN; score=104.0
Pseudomonas fluorescens; COG1466 DNA polymerase III (delta’ subunit); score=104.0
Pseudomonas; IPR004995 Bacillus/Clostridium Ger spore germinat…; score=104.0>WP_041477726.1
Pseudomonas koreensis; score=104.0
Pseudomonas moraviensis; score=104.0
Pseudomonas sp. Irchel 3F6; score=104.0
Pseudomonas sp. Choline-02u-1; score=104.0
Pseudomonas; score=103.0

The third hit has 100% identity to my query and since it has the IPR010662 annotation, it was used to annotate my read.

Pseudomonas; IPR010662 Putative hydrolase RBBP9/YdeN; score=104.0

WP_041477726.1
Length = 345

This annotation is clearly wrong. The protein is actually a DNA polymerase III subunit (as indicated by the COG classification of the hits) and has no homology to IPR010662 proteins. Running InterProScan on the complete protein yields IPR027417, IPR008921, IPR005790, IPR010372 and IPR032780, all of which are correct.

There are many other examples of this in my output. For example WP_076564309.1 is also annotated as belonging to IPR010662 but is actually a DNA helicase that hits IPR027417, IPR014001, IPR011545, IPR001650 and IPR013701, but certainly not IPR010662.

Other proteins with the IPR010662 annotation are actually transcriptional regulators, oxidoreductases, carbonic anhydrases and other proteins with no relation to IPR010662. Many if not most of the assignments to IPR010662 appear to be correct, particularly for species that are not Pseudomonas. However, I estimate that more than 200 of them are incorrect.


#2

Thanks for your detailed bug report, I will look into this.