Protein accession to NCBI-taxonomy mapping file update


#1

Hi,
when will be available the latest (2017.05) protein accession to NCBI-taxonomy mapping file?
Thank you in advance for valuable information.

Cheers: Balázs


#2

Dear Balázs,

I agree that it is time to update the mapping files and will into doing this before the end of the month.
D


#3

I have updated the taxonomy mapping files (and the taxonomy used by MEGAN) to May 22nd, 2017.


#4

Hi Daniel,

thank you, but unfortunately I can see only the old version at http://ab.inf.uni-tuebingen.de/data/software/megan6/download/welcome.html, yet.

Bests: Balázs


#5

The files are now in place.


#6

Great news! :beers:

Thanks: B


#7

Dear Daniel,

is there any possibility to update mapping files to the latest NCBI protein accession numbers?

Best, Denis


#8

Dear Denis,
I have put that on my todo list and will upload new mapping files later this week or early next week.


#9

Dear Daniel,

that`s great! Thank you very much!
Denis


#10

I have just uploaded a new mapping file.


#11

Dear Daniel,

Is it possible to update the protein accession mapping file?

Also, for the previous mapping file release that the following was indicated in the release notes:

— Release notes MEGAN6 /V6_10_2: —

  • Updated the NCBI protein and nucleotide mapping files. Note that due to large number of protein accessions (over 400 million), the new protein mapping file only maps the first identifier on each line of the NCBI-nr database, and thus only contains just over 100 million accessions.
    For a given header line, the first accession is mapped to the LCA of all taxa that are referenced on the header line, so some assignments will be less specific than when using previous mapping files

Can you please explain what this means? Is this removing redundancy in the database?

Thanks!


#12

The release note comment means that the mapping file only keeps the first accession for each NCBI-nr entry. This change will make none or very little difference if you are aligning against the NCBI-nr database.
However, if you are aligning against some other database and are using our NCBI-based mapping file to identify taxa, then this might cause problems if the other database uses accessions that do not appear as the first accession for NCBI-nr references.


#13

I will build new mapping files in the next week or two.


#14

Thank you for your response Daniel!

Do NCBI-nr entries have more than one accession number? I thought each entry had a unique accession number and I can’t seem to find an example of an entry with multiple accession numbers. I am asking because I used the NCBI-nr database with the Oct 2017 mapping file and the number of hits were drastically reduced compared to the results with the May 2017 mapping file, especially for viruses.


#15

Take a look inside the NCBI-nr file nr.gz that one can download from NCBI. The header lines usually contain many and sometimes thousands of accessions each and often the header line much is longer than the actual sequence. That is why alignment programs such as DIAMOND only keep the first word of the header line of a reference sequence by default…


#16

Do the different accession numbers for a given sequence represent redundancy between databases? When you say DIAMOND only keeps the first accession, is there any particular reason to keep the first one, or are the multiple accession numbers in the NCBI-nr database in random order for a given sequence?


#17

But as far as I understand, nr is non-redundant and all the different accessions represent the different non redundant instances.
The first accession is indeed a special accession that points to the non-redundant sequence rather than to any of the redundant instances.

Probably best to read the NCBI documentation on this…


#18

May I ask what the difference is between the mapping files from May 2017 and October 2017? I am still unsure why the number of assigned reads is dropping significantly for all my samples when I update the mapping file from May 2017 to October 2017.


#19

Hi,

When did you download your nr database? The proteins you are looking for may have been deprecated by NCBI between May and October 2017, thus they may not be in the mapping files anymore. I guess the best way to find out will be to export matches from those nodes which differ a lot between the two mapping files, compare the accessions and check those which don’t appear in October 2017 file on NCBI’s website.

Caner.


#20

Thanks for the suggestion Caner. I will try to track down the reads that disappeared from the assigned portion with the new mapping file update. I used an nr database from February 2018 with both the May and October 2017 mapping files.