Planned change in bacterial strain-level information management

Please be aware that there is an upcoming change (January 2014) in how NCBI manages organism strain information. Due to significant increases in the volume of strain-specific sequencing, we are changing our management of strain information.

Next generation sequencing has already changed the way microbial genomes are being used. The scope of microbial sequencing projects has shifted from a single isolate representing an organism to multi-isolate and multi-species projects representing microbial communities. Consequently, in the first nine months of 2013 the sequences of more than 6000 prokaryotic genomes were released by INSDC (DDBJ/ENA/GenBank).

NCBI is introducing several changes in prokaryotic genomes and related resources such as Assembly, BioProject, BioSample, and Taxonomy that will affect your submissions, data downloads, analysis tools, and parsers.

Taxonomy

Assigning strain-level TaxID will be discontinued in January 2014 because curation of strain-level TaxIDs will not remain possible under such growth. However, the thousands of existing strain-level TaxIDs will remain, and we will continue to add informal strain-specific names for genomes from specimens that have not yet been identified to the species level, e.g. “Rhizobium sp. CCGE 510” and “Micromonas sp. RCC299”. The strain information will continue to be collected and displayed.

BioSample

Submitters of genome sequences will be required to register sample meta-data in the BioSample database for each organism that they are sequencing. The BioSample submission will include the strain information and other metadata, such as culture collection and isolation information, as appropriate. The BioSample accession will be a link on the GenBank records, and the GenBank records themselves will display the strain in the source information.

BioProject

Submitters of genome sequences are already required to register meta-data about the research project in the BioProject database. We no longer require a one-to-one relationship between a BioProject accession and a genome. Instead, a research effort examining multiple strains of a species or multiple species of drug-resistant bacteria, for example, could be registered as a single BioProject.

Assembly

Each genome assembly is loaded to the Assembly database and assigned an Assembly accession. The Assembly accession is specific for a particular genome submission.

A BioProject ID or accession cannot be used to define a single genome, since many may belong to a multi-isolate or multi-species project. Furthermore, a TaxID can no longer reliably define an individual genome since unique TaxIDs will not be assigned for individual strains and isolates. The collection of DNA sequences of an individual sample (isolate) will be represented by a BioSample accession and if raw sequence reads are assembled and submitted to GenBank they will get a unique Assembly accession. The Assembly accession is specific for a particular genome submission. For example, sequence data generated from a single sample (with a BioSample accession) could be assembled with two different algorithms and so have two sets of GenBank accessions, each with its own Assembly accession.

For example, BioProject PRJNA203445 is a multi-species project with multiple strains and isolates of different food pathogens. Each isolate has its own BioSample accession and each assembled genome has its own Assembly accession. This BioProject includes an isolate of Listeria monocytogenes (TaxID 1639, strain R2-502) which was registered as BioSample SAMN02203126, and its genome is represented in GenBank records CP006595-CP006596, which are tracked as a group in the Assembly database under accession GCA_000438585.

Genome text reports on the FTP site have been modified to include the BioSample and Assembly accessions. These two columns were added at the end of the tables to minimize problems for existing parsers. Initially, not all assemblies will have a BioSample accession because we are still in the process of back-filling BioSamples for genomes.

These changes will occur in January 2014. We will be releasing more information as the date approaches.

View the original article here