In order to increase the utility of genomic information, we provide gene annotation and other features on Reference Sequence (RefSeq) genome records. Genome annotation is a multi-step process that includes prediction of protein-coding genes, as well as other functional genome units such as structural RNAs, tRNAs, small RNAs, pseudogenes, control regions, direct and inverted repeats, insertion sequences, transposons, and other mobile elements.
Depending upon the genome, the identification of key genomic features and their locations on RefSeq genome records are provided by outside sources (the submitter’s annotation copied from the GenBank genomic sequence records or curated annotation provided by a model organism database, like FlyBase or WormBase), or are generated by annotation pipelines developed at NCBI specifically for eukaryotic or for prokaryotic genomes.
An overview of each pipeline is available in our web documentation In addition to web documentation of our eukaryotic genome annotation pipeline and prokaryotic genome annotation process.
Our newest NCBI Handbook Chapters on the eukaryotic and prokaryotic annotation pipelines describe the processes in greater detail, including information on algorithms, history, annotation standards and special considerations like multiple annotation assemblies:
We also provide eukaryotic genome annotation policies and the status of genomes in the current pipeline, as well as information about prokaryotic genome annotation standards.