RefSeq

Quality 0.50 · 2 views · Updated 2 months ago

Database containing reference sequences of genes, proteins and transcripts

title: "RefSeq" type: doc version: 1 created: 2026-02-28 author: "Wikipedia contributors" status: active scope: public tags: ["genetics-databases", "national-institutes-of-health"] description: "Database containing reference sequences of genes, proteins and transcripts" topic_path: "technology/databases" source: "https://en.wikipedia.org/wiki/RefSeq" license: "CC BY-SA 4.0" wikipedia_page_id: 0 wikipedia_revision_id: 0

::summary Database containing reference sequences of genes, proteins and transcripts ::

::data[format=table title="infobox biodatabase"]

Field	Value
title	Refseq
logo	[[File:US-NLM-NCBI-Logo.svg
description	curated non-redundant sequence database of genomes.
center	National Center for Biotechnology Information
citation	Pruitt KD & al. (2005)
url	https://www.ncbi.nlm.nih.gov/refseq/
::

For each model organism, RefSeq aims to provide separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. RefSeq is limited to major organisms for which sufficient data are available (121,461 distinct "named" organisms as of July 2022), while GenBank includes sequences for any organism submitted (approximately 504,000 formally described species).

RefSeq categories

RefSeq collection comprises different data types, with different origins, so it is necessary to establish standard categories and identifiers to store each data type. The most important categories are: ::data[format=table title="RefSeq accession categories and molecule types"]

Category	Description
NC	Complete genomic molecules
NG	Incomplete genomic region
NM	mRNA
NR	ncRNA
NP	Protein
XM	predicted mRNA model
XR	predicted ncRNA model
XP	predicted Protein model (eukaryotic sequences)
WP	predicted Protein model (prokaryotic sequences)
::

For more details and more categories, see Table 1 in Chapter 18 of the book The Reference Sequence (RefSeq) Database.

RefSeq Projects

Several projects to improve RefSeq services are currently in development by the NCBI, often in collaboration with research centers such as EMBL-EBI:

Consensus CDS (CCDS): This project aims to identify a core set of human and mouse protein-coding regions and standardize sets of genes with high and consistent levels of genomic annotation quality. This project was announced in 2009 and is still in development.
RefSeq Functional Elements (RefSeqFE): It is focused on describing non-genic functional elements which are gene regulatory regions such as: enhancers, silencers, DNase I hypersensitive regions, DNA replication origins etc.). The current scope of this project is restricted to the human and mouse genomes.
RefSeqGene: Its main goal is to define genomic sequences to be used as reference standards for well-characterized genes. Previously described mRNA, protein and chromosome sequences have the weaknesses of not providing explicit genomic coordinates of gene flanking and intronic regions as well as showing awkwardly large coordinates that change with every new genome assembly. The RefSeqGene project is designed to eliminate these errors.
Targeted Loci: This project records molecular markers, specially protein-coding and ribosomal RNA loci that are used for phylogenetic and barcoding analysis. The scope of this project includes sequences for Archaea, Bacteria and Fungi organisms, accessible via Entrez and BLAST queries. It also includes GenBank sequences for Animals, Plants and Protists, accessible via BLAST queries.
Virus Variation (ViV): It is a specific resource of sequence data processing pipelines and analysis tools for display and retrieval of sequences from several viral groups such as influenza virus, ebolavirus, MERS coronavirus or Zika virus. New viruses, processing pipelines, tools and other features are included regularly.
RefSeq Select: This project aims to select datasets of RefSeq Select transcripts, as the most representative for every protein-coding gene, based on multiple criteria: prior use in clinical databases, transcript expression, evolutionary conservation of the coding region etc. Since many genes are represented by multiple RefSeq transcripts/proteins due to the biological process of alternative splicing, this complexity is problematic for studies such as comparative genomics or exchange of clinical variant data.
MANE (Matched Annotation from the NCBI and EMBL-EBI): It is a collaborative project between NCBI and EMBL-EBI whose main goal is to define a set of transcripts and their proteins for all the protein-coding genes in the human genome. By doing that, the differences in transcripts annotation between RefSeq and Ensembl/GENCODE annotation systems are reduced. A MANE Select transcripts set are created as a useful universal standard for clinical reporting and comparative or evolutionary genomics. A second MANE Plus Clinical set are also created with additional transcripts to report all Pathogenic (P) or Likely Pathogenic (LP) clinical variants available in public resources. This project was announced in 2018 and is expected to finish in 2022.

Statistics

According to the RefSeq release 213 (July 2022), the number of species represented in the database by counting distinct taxonomic IDs are as follows: ::data[format=table title=""]

Taxonomic ID	Species
Archaea	1443
Bacteria	69122
Fungi	16869
Invertebrate	5715
Mitochondrion	13648
Plant	9177
Plasmid	6073
Plastid	9430
Protozoa	746
Vertebrate (mammalian)	1509
Viral	11620
Vertebrate (other)	5237
Other	4
Complete	121461
::

The counts of accession and basepairs per molecule type are: ::data[format=table title=""]

Molecule type	Accessions	Basepairs/residues
Genomics
RNA
Protein
::

References

Sources

References

(January 2005). "NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins". Nucleic Acids Research.
(January 2000). "NCBI's LocusLink and RefSeq". Nucleic Acids Research.
(January 2000). "Introducing RefSeq and LocusLink: curated human genome resources at the NCBI". Trends in Genetics.
(11 July 2022). "RefSeq Release 213 Statistics". [[National Library of Medicine]].
(January 2022). "GenBank". Nucleic Acids Research.
(July 2009). "The consensus coding sequence (CCDS) project: Identifying a common protein-coding gene set for the human and mouse genomes". Genome Research.
(January 2018). "Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation". Nucleic Acids Research.
(January 2022). "RefSeq Functional Elements as experimentally assayed nongenic reference standards and functional interactions in human and mouse". Genome Research.
(June 2007). "Clinical laboratory reports in molecular pathology". Archives of Pathology & Laboratory Medicine.
"NCBI RefSeq Targeted Loci Project".
(January 2017). "Virus Variation Resource - improved response to emergent viral outbreaks". Nucleic Acids Research.
"NCBI RefSeq Select".
(April 2022). "A joint NCBI and EMBL-EBI transcript set for clinical genomics and research". Nature.

::callout[type=info title="Wikipedia Source"] This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page. ::