= Chromosome 18 open reading frame 12 =

Chromosome 18 open reading frame 12, also known as C18orf12, is a protein encoded in humans by the C18orf12 gene.

== Function ==
The exact biological function of C18orf12 is not well understood by the scientific community. According to NCBI Gene, the gene encodes a small predicted intracellular protein with no experimentally characterized role. Computational analysis of the protein sequence suggest that it may be localized to the cytoplasm and nucleus, and could be involved in regulatory processes within the cell membrane. A genome wide association study has linked a locus near C18orf12 to be involved regulatory factors surrounding human height. Further studies are needed to clarify its molecular function and biological significance.

== Gene ==
C18orf12 refers to the 12th open reading frame from chromosome 18q21.1 of the human genome. C18orf12 has a single exon and spans from 45,778,672 - 45,779,208 base pairs (537 bp). The gene encodes the protein C18orf12, also referred to as HEIL1 and HsT2508 in some sequence databases, has occasionally been listed as an alias of ZBTB7C-AS2 (Zinc Finger and BTB Domain Containing 7C - antisense RNA 2), a long non-coding RNA transcript overlapping the same genomic locus on the (+) strand. This overlap has caused some ambiguity in gene annotations across resources, but Human C18orf12 & strict orthologs (primates) were found to transcript an mRNA that translate into a protein.

=== Gene neighborhood summary ===
Human C18orf12 is located ~400 kb upstream of SMAD2 (Sma from Caenorhabditis elegans + Mad from Drosophila melanogaster of family member 2), a transcription factor involved in TGF-β signaling and osteoblast differentiation, and ~300 kb downstream of CTIF (CapBindingComplex-Dependent Translation Initiation Factor), also known as KIAA0427.

== Transcript ==
The gene's mRNA is consistent with annotations from AceView and NCBI, and the full transcript is translated into a 178 amino acid (aa) protein with no alternative isoforms or variants reported. Data for gene expression is limited, but a study was conducted for invariant genes, instead of variant genes which are usually studied for genetic diseases or disorders, that identified C18orf12 as an invariant gene across all studied human populations. This indicates a strong evolutionary constraint and possible essential function as the gene has not changed across many populations of humans. This concurs with the data found in analysis of the gene's homology as the gene has only been reported in primate orthologs and is relatively conserved (Table 3).

Researchers also took a meta-analysis of genome wide association studies from the Chinese Han people, the largest ethnic population in the world, and found three genes to have a statistically significant chance (p value < 5 x 10-8) of contributing to the complex genetic trait of human height, one of those being a single nucleotide polymorphism (SNP) near C18orf12. The SNP is located roughly 400 kp upstream of the SMAD2 gene, which regulates TGF-beta signals and can have an impact on cell proliferation, differentiation, and apoptosis, meaning C18orf12 may be part of a regulatory locus relevant to skeletal growth and bone development.

=== Tissue expression ===
NCBI Gene was not able to report histograms of RNA sequencing data for various tissue due to the alias confusion, and ZBTB7C-AS2 (a non-coding gene) would not be expressed in any tissues. A database for gene expression in animals found by the SIB Swiss Institute of Bioinformatics, saw high likelihood of expression in the ligaments (Figure 2).

== Protein ==
The theoretical isoelectric point and molecular weight of C18orf12's protein sequence were determined by analyzing its protein sequence to be 6 pI and 19.7 kDa, respectively. Additionally, the proteins sequence was analyzed for its internal composition and repeats, as seen in Figure 3.

C18orf12 was observed to be lysine (K) poor when compared to the amino acid content of the average human protein from the HUMAN.q database. Serine (S) and leucine (L) also make up a large portion of the protein with 11.2% and 13.5%, respectively. C18orf12 is also observed to have a low KR, which are basic residues, suggesting that the protein is less basic than the average human protein. KR-ED also shows a net negative charge which can influence the isoelectric point, and is consistent with the acidic theoretical isoelectric point, 6 pI. Additionally, no significant positive or negative clusters were found, but one mixed charge cluster exists from residues 169-197. The longest stretch was seen from neutral residues, stretching 44 amino acids. This aligns with a largely neutral/weakly acid protein, which indicates function in the cytoplasm as a highly charged protein is not required. Periodicity analysis showed leucine repeated every 7 amino acids from residues 76-103 and 254-281. This could suggest a leucine zipper or coiled-coil region, and a zipper leucine pattern is found when analyzing the predicted secondary structure of the protein. Finally, the protein sequences of close orthologs (see Table 3) were also analyzed and compared, with similar results and no significant differences.

=== Predicted secondary structure ===

With C18orf12's protein sequence, the sub-cellular location was predicted with a 65% chance to have function in the cytoplasm and 18% chance to have function in the nucleus. Another analysis of C18orf12's protein sequence with a different tool focusing on signal importance predicted the protein to have 60% function in the nucleus and 52% function in the cytoplasm. Additionally, PSORT II found a leucine zipper pattern from amino acid positions 76-97, the ALOM: Klein et al.'s method for TM region allocation indicates potential for two transmembrane proteins with a high likelihood of the protein also being peripheral which is consistent with its sub-cellular location of the cytoplasm and nucleus. Both these tools predicted the protein's function in the cytoplasm and nucleus, which is consistent with what has been found in the other tools used to analyze C18orf12.

=== Post translational modifications ===

  - Table 1: Protein Motifs of C18orf12**

| Elm Name | Matched Sequence | Position | Description | Cell Compartment | Probability |
| CLV_PCSK_SKI1_1 | RDLTL | 142-146 | Subtilisin/kexin isozyme-1 (SKI1) cleavage site ([RK]-X-[hydrophobic]-[LTKF]-|-X). | ER, golgi, extracellular | 6.82E-03 |
| DEG_SPOP_SBC_1 | ASSST | 157-161 | The S/T rich motif known as the SPOP-binding consensus (SBC) of the MATH-BTB protein, SPOP, is present in substrates that undergo SPOP/Cul3-dependant ubiquitination. | nucleus, Cul3-RING ubiquitin ligase complex | 9.38E-04 |
| DOC_USP7_MATH_1 | PTASL | 120-124 | The USP7 MATH domain binding motif variant based on the MDM2 and p53 interactions. | nucleus | 1.24E-02 |
| LIG_FHA_1 | KVTSSLW, PPTASLN, SSTHGIS | 47-53, 119-125, 159-165 | Phosphothreonine motif binding a subset of FHA domains that show a preference for a large aliphatic amino acid at the pT+3 position. | nucleus | 8.66E-03 |
| LIG_LIR_Gen_1 | EYNTMASTF, SSLWASVSSFL | 19-27, 50-60 | Canonical LIR motif that binds to Atg8/LC3 protein family members to mediate processes involved in autophagy. | cystol, cytoplasmic side of late endosome membrane | 3.63E-03 |
| LIG_SUMO_SIM_anti_2 | DLTLMP | 143-148 | Motif for the antiparallel beta augmentation mode of non-covalent binding to SUMO protein. | PML body, nucleus, nuclear body | 2.35E-03 |
| MOD_CK1_1 | SSSTHGI | 158-164 | CK1 phosphorylation site | cystol, nucleus | 1.70E-02 |
| MOD_CK2_1 | NAESGRE | 167-173 | Casein kinase 2 (CK2) phosphorylation site | protein kinase CK2 complex, nucleus, cystol | 1.46E-02 |
| MOD_GlcNHglycan | ESGR | 169-172 | Glycosaminoglycan attachment site | extracellular, golgi | 1.79E-02 |
| MOD_GSK3_1 | EYNTMAST, VTSSLWAS, FVCSLSDT, SSSTHGIS | 19-26, 48-55, 106-113, 158-165 | GSK3 phosphorylation recognition site | cystol, nucleus | 2.68E-02 |
| MOD_Plk_1 | RDLTLMP | 142-148 | Ser/Thr residue phosphorylated by the Plk1 kinase | nucleus, cystol | 7.67E-03 |
| MOD_Plk_4 | VTSSLWA, RDLTLMP | 48-54, 142-148 | Ser or Thr residue phosphorylated by Plk4 | nucleus, cystol | 6.02E-03 |
Table 1 shows the results of Human C18orf12 protein sequence entered into ELM search, giving several linear motifs. Many motifs were found to be either nuclear or cytosolic (SPOP, USP7, FHA, SUMO, CK1/2, GSK3, Plk1/4). This again reaffirms localization prediction of C18orf12 in the cytoplasm/nucleus. There are multiple kinase sites (CK1/2, GSK3, Plk1/4) predicted by ELM that overlap with DTU Health predictions, reinforcing the predictions seen through these sources. Motifs, such as SPOP and USP7, suggest C18orf12 is regulated by ubiquitination which occurs intracellular, reaffirming its location in the nucleus/cytoplasm. There is also a motif involved in glycosylation, which is an extracellular attachment, but this single motif contradicts other data showing negative results for signal peptides and localization predictions being intracellular, so the motif is likely not real.

  - Table 2: Summary of Post Translational Modifications of C18orf12**

| Tool | PTM | Confidence |
| DTU Health & ELM | Predicted Phosphorylation Sites | 0.6-0.9 |
| DTU Health | Predicted O-Glycolysation Sites | 0.6-0.9 |
| DTU Health | Predicted C-mannosylation Sites | >0.5 |
| DTU Health & GPS & ELM | Predicted Kinases | 0.6-0.9 |
| Expasy | Sulfinator | 0 |
| PSORT II | Two Transmembrane Domains | 0.79 |
| DTU Health & Expasy | Arginine & Lysine Propeptide Cleavage Sites | 0 |
| DTU Health | N-Glycosylation | 0 |
| Expasy | Bioactive Small Molecules | 0 |
| Expasy | Oligosaccharide Structures | 0 |
N-linked glycosylation sites were searched for in Human C18orf12 and other close orthologs as well. Nothing was found. This is consistent with the subcellular localization data I’ve seen from several tools as N-linked glycosylation sites are involved in protein folding, stability, and trafficking for mostly secreted proteins, and C18orf12 has shown predicted location in the cytoplasm/nucleus. Additionally, no signal peptides have been found in C18orf12, reaffirming the protein stays intracellular as N-glycosylation occurs in the endoplasmic reticulum which has functions linked with signal peptides in usually facilitating movement of proteins across cellular membranes.

A O-GalNAc (mucin type) glycosylation site in mammalian protein search was done with Human C18orf12 and other close orthologs. Several sites scored past the 0.5 threshold and were annotated onto the conceptual translation and schematic illustration (Figure 4), with most sites being seen near the C-terminus of the protein across close orthologs. This modification affects protein stability. Additionally, conservation and clustering of these O-GalNAc sites indicates functional importance and could perhaps play a role in protein-protein interactions or stability within the cytoplasm or nucleus.

Arginine and lysine propeptide cleavage sites in Human C18orf12 and other close orthologs using DTU Health search resulted in no predicted peptide cleavage sites. These cleavage sites often occur in secretory proteins, so seeing no cleavage sites is consistent with the predicted sub-cellular location of the nucleus and cytoplasm.

ALOM: Klein et al.'s method reaffirms that two transmembrane domains do exist in the Human C18orf12 protein as well as other close orthologs as it was previously seen and conserved across primate orthologs. Secondly, we again can reaffirm the protein containing no signal peptides as DTU health & Expasy failed to predict any.

=== Predicted Tertiary Structure ===

Human C18orf12 protein sequence was used to generate a predicted tertiary structure (Figure 5) which yielded several beta sheets and alpha helices throughout its structure. Additionally, a 4% coverage (from aa position: 26-30) of the Human C18orf12 full protein sequence predicted with 51.1% confidence to match with the FhuA protein with 100% identity. Found in Escherichia coli K-12, FhuA is a membrane protein involved in iron uptake, specifically ferrichrome transport. This result reinforces the confidence of that domain's fold (from aa position: 26-30) in Human C18orf12 and could imply similar functional characteristics to FhuA to that specific region of C18orf12, but this does not mean that C18orf12 is a membrane transporter.

== Homology/Evolution ==
20 orthologs of C18orf12 were collected and identified as seen in Table 3 below. C18orf12 was not found in any distant orthologs, with the protein sequence only being found in primates. Among those primates, orthologs with a median date of divergence of less than 20 million years ago (MYA) were found to have 90.5% sequence identity. Orthologs with a median date of divergence of 28.8 MYA were found to have from 79% - 91% sequence identity to the human protein. Finally, orthologs with a median date of divergence of 43 MYA were found to have a sequence identity of 71% - 76% to the human protein. No paralogs were found for this protein. NCBI Homologene, text based searches for the paralog of the protein, and NCBI BLAST were used to search for paralogs of C18orf12.

  - Table 3: C18orf12 Strict Orthologs and Related Properties**

| Abbreviation | Genus Species | Common Name | Taxonomic Group | Date of Divergance (MYA) | Accession # | Sequence Length (aa) | Sequence Identity to Human Protein (%) | Sequence Similarity to Human Protein (%) |
| Hsa | Homo sapiens | Human | Primate (Apes & Humans) | 0.0 | Q96KH6.1 | 178 | 100.0 | 100 |
| Pab | Pongo abelii | Sumatran Orangutan | Primate (Apes & Humans) | 15.2 | XP_024091948.1 | 175 | 90.5 | 92.7 |
| Nle | Nomascus leucogenys | Northern White-Cheeked Gibbon | Primate (Apes & Humans) | 19.5 | XP_012361647.1 | 178 | 90.5 | 92.7 |
| Pte | Piliocolobus tephrosceles | Ugandan Red Colobus | Primate | 28.8 | XP_023063357.1 | 180 | 91.1 | 92.2 |
| Tfr | Trachypithecus francoisi | Francois' Leaf Monkey | Primate | 28.8 | XP_033033693.1 | 178 | 91.0 | 92.1 |
| Pan | Papio anubis | Olive Baboon | Primate | 28.8 | XP_031514678.1 | 181 | 89.5 | 90.6 |
| Mle | Mandrillus leucophaeus | Drill | Primate | 28.8 | XP_011824377.1 | 181 | 89.5 | 90.6 |
| Mmu | Macaca mulatta | Indochinese Rhesus Macaque | Primate | 28.8 | XP_028693968.1 | 181 | 89.5 | 91.2 |
| Mne | Macaca nemestrina | Southern Pig-Tailed Macaque | Primate | 28.8 | XP_070941502.1 | 180 | 89.4 | 91.1 |
| Cap | Colobus angolensis palliatus | Angolo Clobus | Primate | 28.8 | XP_011804754.1 | 180 | 89.4 | 91.1 |
| Mfa | Macaca fascicularis | Crab-Eating Macaque | Primate | 28.8 | XP_045233967.1 | 181 | 89.0 | 91.1 |
| Tge | Theropithecus gelada | Gelada | Primate | 28.8 | XP_025220923.1 | 181 | 87.9 | 90.1 |
| Rro | Rhinopithecus roxellana | Golden Snub-Nosed Monkey | Primate | 28.8 | XP_030782025.1 | 178 | 86.1 | 87.8 |
| Rbi | Rhinopithecus bieti | Black-and-White Snub-Nosed Monkey | Primate | 28.8 | XP_017717790.1 | 178 | 86.1 | 87.8 |
| Csa | Chlorocebus sabaeus | Green Monkey | Primate | 28.8 | XP_007972286.1 | 203 | 79.8 | 80.8 |
| Ana | Aotus nancymaae | Nancy Ma's Night Monkey | Primate | 43.0 | XP_021525747.2 | 179 | 75.8 | 80.2 |
| Sap | Sapajus apella | Tufted Capuchin | Primate | 43.0 | XP_032115299.1 | 179 | 74.7 | 79.7 |
| Cja | Callithrix jacchus | Common Marmoset | Primate | 43.0 | XP_035126892.1 | 172 | 73.3 | 77.8 |
| Sbo | Saimiri boliviensis | Black-Capped Squirrel Monkey | Primate | 43.0 | XP_039319791.1 | 178 | 71.4 | 77.6 |
| Cim | Cebus imitator | Panamanian White-Faced Capuchin | Primate | 43.0 | XP_037594901.1 | 178 | 71.0 | 77.5 |

=== Gene Evolution ===
This gene most likely appeared before the divergence of old world monkeys, such as the Macaca mulatta, and new world monkeys, such as the Cebus imitator, likely around 43 million years ago (MYA). C18orf12 is a single family gene with no known isoforms, indicating that the gene has a functional constraint.

== Conceptual Translation ==
A conceptual translation of C18orf12 was created to showcase importance and conserved features across primates. These predicted features were gathered from the computational analysis of the protein, and several beta sheets and alpha helices were found. Predicted phosphorylation and O-glycosylations sites, two transmembrane domains, internal repeats, single nucleotide polymorphisms (SNP), and a leucine zipper pattern were annotated if aligned with the highly conserved regions (bolded) across primate orthologs (Figure 7).
