Gene nomenclature is the scientific naming of genes, the units of heredity in living organisms. An international committee published recommendations for genetic symbols and nomenclature in 1957. The need to develop formal guidelines for human gene names and symbols was recognized in the 1960s and full guidelines were issued in 1979 (Edinburgh Human Genome Meeting). Several other species-specific research communities (e.g., Drosophila, mouse) have adopted nomenclature standards, as well, and have published them on the relevant model organism websites and in scientific journals, including the Trends in Genetics Genetic Nomenclature Guide. Scientists familiar with a particular gene family may work together to revise the nomenclature for the entire set of genes when new information becomes available. For many genes and their corresponding proteins, an assortment of alternate names is in use across the scientific literature and public biological databases, posing a challenge to effective organization and exchange of biological information. Standardization of nomenclature thus tries to achieve the benefits of vocabulary control and bibliographic control, although adherence is voluntary.
- 1 Nomenclature guidelines
- 2 Gene and protein symbol and description in copyediting
- 3 Notes and references
- 4 External links
The HUGO Gene Nomenclature Committee is responsible for providing human gene naming guidelines and approving new, unique human gene names and symbols (short form abbreviations). For some non-human species, model organism databases serve as central repositories of guidelines and help resources, including advice from curators and nomenclature committees. In addition to species-specific databases, approved gene names and symbols for many species can be located in the National Center for Biotechnology Information's Entrez Gene database. The nomenclature for bacteria differs from the one used for eukaryotic species Bacterial_genetic_nomenclature.
Vertebrate gene and protein symbol conventions
|Gene and protein symbol conventions ("sonic hedgehog" gene)|
|Species||Gene symbol||Protein symbol|
|Mus musculus, Rattus norvegicus||Shh||SHH|
|Xenopus laevis, X. tropicalis||shh||Shh|
The research communities of vertebrate model organisms have adopted guidelines whereby genes in these species are given, whenever possible, the same names as their human orthologs. The use of prefixes on gene symbols to indicate species (e.g., "Z" for zebrafish) is discouraged. The recommended formatting of printed gene and protein symbols varies between species.
Gene symbols generally are italicised, with all letters in uppercase (e.g., SHH, for sonic hedgehog). Italics are not necessary in gene catalogs. Protein designations are the same as the gene symbol, but are not italicised, with all letters in uppercase (SHH). mRNAs and cDNAs use the same formatting conventions as the gene symbol.
Mouse and rat
Gene symbols generally are italicised, with only the first letter in uppercase and the remaining letters in lowercase (Shh). Italics are not required on web pages. Protein designations are the same as the gene symbol, but are not italicised and all are upper case (SHH).
Chicken (Gallus sp.)
Nomenclature generally follows the conventions of human nomenclature. Gene symbols generally are italicised, with all letters in uppercase (e.g., NLGN1, for neuroligin1). Protein designations are the same as the gene symbol, but are not italicised; all letters are in uppercase (NLGN1). mRNAs and cDNAs use the same formatting conventions as the gene symbol.
Anole lizard (Anolis sp.)
Gene symbols are italicised and all letters are in lowercase (shh). Protein designations are the same as the gene symbol, are not italicised, and all letters are in uppercase (SHH).
Frog (Xenopus sp.)
Gene symbols are italicised and all letters are in lowercase (shh). Protein designations are the same as the gene symbol, are not italicised; the first letter is in uppercase and the remaining letters are in lowercase (Shh).
Gene symbols are italicised, with all letters in lowercase (shh). Protein designations are the same as the gene symbol, but are not italicised; the first letter is in uppercase and the remaining letters are in lowercase (Shh).
See also Bacterial Genetic Nomenclature
Gene and protein symbol and description in copyediting
A nearly universal rule in copyediting of articles for public health journals is that abbreviations and acronyms must be expanded at first use, to provide a glossing type of explanation. Typically no exceptions are permitted except for small lists of especially well known terms (such as DNA or HIV). Although readers with high subject-matter expertise do not need most of these expansions, those with intermediate or (especially) low expertise are appropriately served by them.
One complication that gene and protein symbols bring to this general rule is that they are not, accurately speaking, abbreviations or acronyms, despite the fact that many were originally coined via abbreviating or acronymic etymology. They are pseudoacronyms (as SAT and KFC also are) because they do not "stand for" any expansion. Rather, the relationship of a gene symbol to the gene name is functionally the relationship of a nickname to a formal name (both are complete identifiers)—it is not the relationship of an acronym to its expansion. In fact, many official gene symbol–gene name pairs do not even share their initial-letter sequences (although some do). Nevertheless, gene and protein symbols "look just like" abbreviations and acronyms, which presents the problem that "failing" to "expand" them (even though it is not actually a failure and there are no true expansions) creates the appearance of violating the spell-out-all-acronyms rule.
One common way of reconciling these two opposing forces is simply to exempt all gene and protein symbols from the glossing rule. This is certainly fast and easy to do, and in highly specialized journals, it is also justified because the entire target readership has high subject matter expertise. (Experts aren't confused by the presence of symbols (whether known or novel) and they know where to look them up online for further details if needed.) But for journals with broader and more general target readerships, this action leaves the readers without any explanatory annotation and can leave them wondering what the apparent-abbreviation stands for and why it was not explained. Therefore a good alternative solution is simply to put either the official gene name or a suitable short description (gene alias/other designation) in parentheses after the first use of the official gene/protein symbol. This meets both the formal requirement (the presence of a gloss) and the functional requirement (helping the reader to know what the symbol refers to). The same guideline applies to shorthand names for sequence variations; AMA says, "In general medical publications, textual explanations should accompany the shorthand terms at first mention." Thus "188del11" is glossed as "an 11-bp deletion at nucleotide 188." This corollary rule (which forms an adjunct to the spell-everything-out rule) often also follows the "abbreviation-leading" style of expansion that is becoming more prevalent in recent years. Traditionally, the abbreviation always followed the fully expanded form in parentheses at first use. This is still the general rule. But for certain classes of abbreviations or acronyms (such as clinical trial acronyms [e.g., ECOG] or standardized polychemotherapy regimens [e.g., CHOP]), this pattern may be reversed, because the short form is more widely used and the expansion is merely parenthetical to the discussion at hand. The same is true of gene/protein symbols.
Some basic conventions, such as (1) that animal/human homolog (ortholog) pairs differ in letter case (title case and all caps, respectively) and (2) that the symbol is italicized when referring to the gene but nonitalic when referring to the protein, are often not followed by contributors to public health journals. Many journals have the copyeditors restyle the casing and formatting to the extent feasible, although in complex genetics discussions only subject-matter experts (SMEs) can effortlessly parse them all. One example that illustrates the potential for ambiguity among non-SMEs is that some official gene names have the word "protein" within them, so the phrase "brain protein I3 (BRI3)" (referring to the gene) and "brain protein I3 (BRI3)" (referring to the protein) are both valid. The AMA Manual gives another example: both "the TH gene" and "the TH gene" can validly be parsed as correct ("the gene for tyrosine hydroxylase"), because the first mentions the alias (description) and the latter mentions the symbol. This seems confusing on the surface, although it is easier to understand when explained as follows: in this gene's case, as in many others, the alias (description) "happens to use the same letter string" that the symbol uses. (The matching of the letters is of course acronymic in origin and thus the phrase "happens to" implies more coincidence than is actually present; but phrasing it that way helps to make the explanation clearer.) There is no way for a non-SME to know this is the case for any particular letter string without looking up every gene from the manuscript in a database such as NCBI Gene, reviewing its symbol, name, and alias list, and doing some mental cross-referencing and double-checking (plus it helps to have biochemical knowledge). Most medical journals do not (in some cases cannot) pay for that level of fact-checking as part of their copyediting service level; therefore, it remains the author's responsibility. However, as pointed out earlier, many authors make little attempt to follow the letter case or italic guidelines; and regarding protein symbols, they often won't use the official symbol at all. For example, although the guidelines would call p53 protein "TP53" in humans or "Tp53" in rats, most authors call it "p53" in both (and even refuse to call it "TP53" if edits or queries try to), although they are usually willing to call the gene TP53, and may even do so without being prompted by a query. The end result is that the published literature often does not follow the nomenclature guidelines.
Notes and references
- Report of the International Committee on Genetic Symbols and Nomenclature (1957). Union of International Sci Biol Ser B, Colloquia No. 30.
- About the HGNC
- Genetic nomenclature guide (1995). Trends Genet.
- The Trends In Genetics Nomenclature Guide (1998). Elsevier, Cambridge.
- Guidelines for Human Gene Nomenclature
- Fundel and Zimmer (2006). Gene and protein nomenclature in public databases. BMC Bioinformatics 7:372.
- Rules for Nomenclature of Genes, Genetic Markers, Alleles, and Mutations in Mouse and Rat
- The chicken gene nomenclature committee report
- Developing a community-based genetic nomenclature for anole lizards
- Suggested Xenopus Gene Name Guidelines
- Zebrafish Nomenclature Guidelines
- Iverson, Cheryl, et al. (eds) (2007). "15.6.1 Nucleic Acids and Amino Acids". AMA Manual of Style (10th ed.). Oxford, Oxfordshire: Oxford University Press. ISBN 978-0-19-517633-9.
- The Council of Science Editors (CSE) - Resources for Genetic and Cytogenetic Nomenclature
- The Protein Naming Utility, a rules database for protein nomenclature