Wikipedia:WikiProject Molecular Biology/Style guide (gene and protein articles)

From Wikipedia, the free encyclopedia

This is a guideline for the structure of gene and protein articles on Wikipedia. It contains the articles naming conventions and the general recommended outline of an article, as well as useful information to bring an article to good article or featured article status.

General considerations[edit]

The scope of a gene/protein article is the human gene/protein (including all splice variants derived from that gene) as well as orthologs (as listed in HomoloGene) that exist in other species. If there are paralogs in humans (and by extension other species), then a gene family article in addition to the gene specific articles (see for example dopamine receptor) would be appropriate.

In general, do not hype a study by listing the names, credentials, institutions, or other "qualifications" of their authors. Wikipedia is not a press release. Article prose should focus on what a cited study says about the structure, function, clinical significance, etc. of the gene or protein, not what the gene or protein says about a particular study or the research group who conducted that study. Particularly notable contributions along with who made the discovery however should be mentioned in the discovery/history section.

Article name[edit]

If relatively short, the recommended UniProt protein name should be used as the article name. If the protein name is verbose, either a widely used protein acronym or the official HUGO gene symbol, followed by "(gene)" if necessary to disambiguate. UniProt names generally follow the IUBMB recommendations:

When naming proteins which can be grouped into a family based on homology or according to a notion of shared function (like the interleukins), the different members should be enumerated with a dash "-" followed by an Arabic number, e.g. "desmoglein-1", "desmoglein-2", etc.
— "Protein Naming Guidelines". Recommendations on Biochemical & Organic Nomenclature, Symbols & Terminology etc. International Union of Biochemistry and Molecular Biology.

If the article is about a viral protein, it is recommended to include the taxon in the title, as "nonstructual protein 2" and "viral protease" can mean many things. A parenthesized term added to disambiguate common symbols does not constitute unnecessary disambiguation even when it is the only article with such a name.

Gene nomenclature[edit]

The abbreviations of genes are according to HUGO Gene Nomenclature Committee and written in italic font style (the full names are also written in italic). It is recommended that abbreviations instead of the full name are used. Human gene names are written in capitals, for example ALDOA, INS, etc. For orthologs of human genes in other species, only the initial letter is capitalised, for example mouse Aldoa, bovine Ins, etc.

The following usages of gene symbols are recommended:

  • "the ALDOA gene is regulated...",
  • "the rat gene for Aldoa is regulated..." or
  • "ALDOA is regulated...",

while the following is not recommended:

  • "the gene ALDOA is regulated" since it is redundant.

Images and diagrams[edit]

Where possible, diagrams should keep to a standard format. If the diagram guide does not give sufficient guidance on the style for the images in an article, consider suggesting expansions to the standardised formatting.

Infoboxes[edit]

One or more of the following infoboxes as appropriate should be included at the top of each article:

template description / suggested use example article containing this template template filling tool
{{Infobox GNF protein}} for genes/proteins for which an ortholog is present within the human genome (articles containing this template were created as part of the Gene Wiki project) Reelin GeneWikiGenerator
(input: HUGO gene symbol)
{{Infobox protein}} smaller box appropriate for protein family articles where more than one protein is discussed in the same article (e.g., paralogs) Estrogen receptor Wikipedia template filling
(input: HGNC ID)
{{Infobox nonhuman protein}} for proteins without a human ortholog Uterine serpin
{{Infobox protein family}} for protein families (evolutionary related proteins that share a common 3D structure) that are listed in Pfam T-box
{{Infobox rfam}} for RNA families (evolutionary related non-coding RNAs that share a common 3D structure) that are listed in Rfam U1 spliceosomal RNA
{{Infobox enzyme}} for enzymes based on EC number (more properly refers to the reaction catalyzed by the enzyme rather than the enzyme itself)[a] Alcohol dehydrogenase

If there is only one human paralog assigned to a given EC number (the ExPASy database maintains EC number to protein mappings), then in addition to a protein infobox, it may be appropriate to also add the corresponding enzyme infobox. Likewise, if there is only one human paralog that has been assigned to Pfam family, then including a protein family infobox may also be appropriate.

There exist some cases where a large number of infoboxes may apply to an article. You may put less useful ones in a section at the end, laid side-by-side with a table. Collapsing or horizontally scrolling the said table is doubtful, as MOS:COLLAPSE may or may not apply depending on how "extraneous" the boxes are.

Sections[edit]

  1. Lead
    The lead section is defined as "the section before the first headline. The table of contents, if displayed, appears between the lead section and the first headline."
    The first sentence of the lead should define what the scope of the article is. For genes/proteins in which a human ortholog exists, "<recommended UniProt name> is a protein that in humans is encoded by the <approved HUGO gene symbol> gene." would be appropriate.
  2. Gene
    Specific information about the gene (on which human chromosome it is located, regulation, etc.). Much of this basic information may already contained in the infobox and should not be unnecessarily repeated in this section unless especially notable.
  3. Protein
    Specific information about the protein (splice variants, post translational modifications, etc.). Again, much of this basic information may already contained in the infobox and should not be unnecessarily repeated unless especially notable.
  4. Species, tissue, and subcellular distribution
    Optional section that concisely describes what species this gene is expressed (e.g., wide species distribution, bacteria, fungi, vertebrates, mammals, etc.), what tissue the protein is expressed, and which subcellular compartments or organelles the protein is found (excreted, cytoplasm, nucleus, mitochondria, cell membrane).
  5. Function
    Describe the function of the transcribed protein.
  6. Interactions
    Optional section that lists proteins that the protein that is the subject of the article is known to interact with.
  7. Clinical significance
    List diseases or conditions that are a result of a mutation in the gene or a deficiency or excess of the expressed protein.
  8. History/Discovery
    In general, it is not appropriate to mention the research group or institution that conducted a study directly in the text of the article. However it is appropriate to list the names of those who made key discoveries concerning the gene or protein in this section (e.g., the scientist or group that originally cloned the gene, determined its function, linked it to a disease, won a major award for the discovery, etc.).

Example articles of what such an organization may look like are: Protein C, Gonadotropin-releasing hormone or Rubisco.

Wikidata item[edit]

The Wikipedia article should be linked to a Wikidata item of the entity first mentioned in the first sentence of the lead section, which should be written as defined in WP:MCBMOSSECTIONS. Suppose that the first sentence is "Steroid 21-hydroxylase is a protein that in humans is encoded by the CYP21A2 gene." In this case, the Wikipedia article should be linked to a Wikidata item of the steroid 21-hydroxylase protein rather than the gene.

Citing sources[edit]

For guidance on choosing and using reliable sources, see Wikipedia:Identifying reliable sources (natural sciences) and Wikipedia:Reliable sources.

For general guidance on citing sources see Wikipedia:Citing sources, Wikipedia:Footnotes and Wikipedia:Guide to layout.

MCB articles should be relatively dense with inline citations, using either <ref> tags (footnotes) or parenthetical citations. It is not acceptable to write substantial amounts of prose and then add your textbook to the References section as a non-specific or general reference. It is too easy for a later editor to change the body text and then nobody is sure which statements are backed up by which sources.

There is no standard for formatting citations on Wikipedia, but the format should be consistent within any one article. Some editors format their citations by hand, which gives them control over the presentation. Others prefer to use citation templates such as {{Cite journal}}, {{Cite book}}, {{Cite web}}, {{Cite press release}} and {{Cite news}}. Citations in the Vancouver format can be produced using the |vauthors= or |veditors= parameters. The Uniform Requirements for Manuscripts Submitted to Biomedical Journals (URM) citation guidelines list up to six authors, followed by et al. if there are more than six.[1] Some editors prefer to expand the abbreviated journal name; others prefer concise standard abbreviations.

Abstracts of most MCB related journals are freely available at PubMed, which includes a means of searching the MEDLINE database. The easiest way to populate the journal and book citation templates is to use Diberri's template-filling web site or the Universal reference formatter. Search PubMed for your journal article and enter the PMID (PubMed Identifier) into Diberri's template filler or the Universal reference formatter[dead link]. If you use Internet Explorer or Mozilla Firefox (2.0+), then Wouterstomp's bookmarklet can automate this step from the PubMed abstract page. Take care to check that all the fields are correctly populated, since the tool does not always work 100%. For books, enter the ISBN into Diberri's tool. Multiple references to the same source citation can be achieved by ensuring the inline reference is named uniquely. Diberri's tool can format a reference with the PMID or ISBN as the name.

In addition to the standard citation text, it is useful to supply hyperlinks. If the journal abstract is available on PubMed, add a link by typing PMID xxxxxxxxx. If the article has a digital object identifier (DOI), use the {{doi}} template. If and only if the article's full text is freely available online, supply a uniform resource locator (URL) to this text by hyperlinking the article title in the citation. If the full text is freely available on the journal's website and on PubMed Central, prefer to link the former as PubMed central's copy is often a pre-publication draft. When the source text is available in both HTML and PDF, the former should be preferred as it is compatible with a larger range of browsers. If citation templates are used, these links can be supplied via the |pmid=, |doi=, |url= and |pmc= parameters. Do not add a "Retrieved on" date for convenience links to online editions of paper journals (however "Retrieved on" dates are needed on other websources).

For example:

Markup Renders as
Levy R, Cooper P. [http://www.mrw.interscience.wiley.com/cochrane/clsysrev/articles/CD001903/frame.html Ketogenic diet for epilepsy.] Cochrane Database Syst Rev. 2003;(3):CD001903. {{doi|10.1002/14651858.CD001903}}. {{PMID|12917915}}.

Levy R, Cooper P. Ketogenic diet for epilepsy. Cochrane Database Syst Rev. 2003;(3):CD001903. doi:10.1002/14651858.CD001903. PMID 12917915.

A citation using {{cite journal}}:

Markup Renders as
{{cite journal |vauthors=Bannen RM, Suresh V, Phillips GN Jr, Wright SJ, Mitchell JC |title=Optimal design of thermally stable proteins |journal=Bioinformatics |volume=24 |issue=20 |pages=2339–43 |year=2008 |pmid=18723523 |pmc=2562006 |doi=10.1093/bioinformatics/btn450 |url=http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2339}}

Bannen RM, Suresh V, Phillips GN Jr, Wright SJ, Mitchell JC (2008). "Optimal design of thermally stable proteins". Bioinformatics. 24 (20): 2339–43. doi:10.1093/bioinformatics/btn450. PMC 2562006. PMID 18723523.

Or the alternative {{vcite journal}}:

Markup Renders as
{{vcite journal |author=Bannen RM, Suresh V, Phillips GN Jr, Wright SJ, Mitchell JC |title=Optimal design of thermally stable proteins |journal=Bioinformatics |volume=24 |issue=20 |pages=2339–43 |year=2008 |pmid=18723523 |pmc=2562006 |doi=10.1093/bioinformatics/btn450 |url=http://bioinformatics.oxfordjournals.org/cgi/content/full/24/20/2339}}

Bannen RM, Suresh V, Phillips GN Jr, Wright SJ, Mitchell JC. Optimal design of thermally stable proteins. Bioinformatics. 2008;24(20):2339–43. doi:10.1093/bioinformatics/btn450. PMID 18723523. PMC 2562006.


References

Navigation box[edit]

Articles about related proteins may be cross linked by including one or more navigation boxes as appropriate. Examples include:

Categories[edit]

Every Wikipedia article should be added to at least one category. Categories or subcategories that may be appropriate for gene and protein articles include:

Notes[edit]

  1. ^ It is very important to realize that EC describes the reaction, not a protein family. The editor should keep this distinction in mind, even when a source (often the IUBMB itself) describes an EC enzyme as belonging to a certain evolutionarily-defined family or using a certain mechanism. Checking Expacy or BioCyc for proteins to compare in InterPro may help.