User:Was a bee/Gene

From Wikipedia, the free encyclopedia
Human chromosome 7
Genomic location for AAA gene
Genomic location for AAA gene

1. Test[edit]

A test trying to put a marker icon automatically onto the chromosome ideogram image to show location of gene (based on basepair position data stored in wikidata).

In other words, trying to make the image like below automatically.

Location of Sonic hedgehog gene

Currently in Wikimedia Commons, there are about 100 ideogram images which are used to show gene position. See commons:category:Human chromosome ideograms which indicates gene location.

This is marker icon
This is plain ideogram of chromosome 7. Whole ideogram set is here commons:Template:Human chromosome ideograms in svg.

2. Result[edit]

Test result
an arrow
Bp start 155,799,986
Bp end 155,812,273

Good (see the test case box at the right)

3. Calculation detail[edit]

The position where the marker should be put is calculated as follows. Although math expression looks something complex, actual calculation is not so complex. The concepts which are used here are basically only plus and minus for calculating length, and multiplication and division for calculating scaling.

Calculation detail

Other concepts used here are... Dividing sum of gene-start and gene-end by 2 is to get mid point of the gene. Dividing arrow-width by 2 is because image position is set by leftist point of the image, not by the center of the image. Conditional branch for arrow-width is just a technical problem which is needed to choose different shape of rectangles among these (commons:Template:Red rectangle series) as a marker depending on target gene length.



Calculation algorithm is as follows....

Math in one picture

Where red terms are variables retrieved from Wikidata, the blue term is the term calculated based on variables retrieved from Wikidata, and other black terms are constants.

  • : Gene start position from the terminus of p-arm. (unit: basepair, example: 155799986 from wikidata:Q14860072)
  • : Gene end position from the terminus of p-arm. (unit: basepair, example: 155812273 from wikidata:Q14860072)
  • : Length of the th chromosome which contains the target gene. (unit: basepair, example: 159345973 for chromosome 7)
  • : Horizontal position of pter (tip of p arm/short arm) in ideogram image (unit: pixel, example: 6 px)
  • : Horizontal position of qter (tip of q arm/long arm) in ideogram image (unit: pixel, example: 1109 px)
  • : Width of ideogram image (unit: pixel, example: 1125 px)
  • : Shown width of ideogram image in Wikipedia page (unit: pixel, example: 300 px)
  • : Calculated marker width proportional to the gene length. This is generally too small (e.g. 0.05px) and non integer. (unit: pixel)
  • : Actual marker width shown in Wikipedia page. Minimum value is 2 and always integer (using ceiling function). To say, 2, 3, 4, 5...(unit: pixel)

By substituting the terms with example values, we get...

The form in the last lines of each equations are used in program.

Actual calculation is like as follows for SHH gene (wikidata:Q14860072).

Firstly we need to calculate arrow-width.

Here we got 0.022px for marker image width for SHH gene. Is this wrong? No. This result is from tha fact that most genes are very short compared to whole chromosome length. If whole chromosome is shown in about 300px, most human genes (≒10kb) span only from 0.01px to 0.05px, depending on whole chromosome length. So third equation do job here. , to say , is true. Hence we get....

Arrow-width is 2px. Then we can calculate arrow position using this value.

Thus we got answer. We should put the 2px width arrow (this -> ) at 288.2px position.

Position coordinate is from the left (0 px) to the right (300 px).

4. See also[edit]

Effort for reader-friendliness for general readers

Introductory gene textbook website by National Library of Medicine. It includes gene location data for each gene pages.


5. Test with Module[edit]

at Module:Infobox gene/sandbox2

{{#invoke:Infobox gene/sandbox2|getTemplateData}}

https://en.wikipedia.org/w/index.php?title=Sonic_hedgehog&diff=prev&oldid=795122559

Category:Pages with script errors - Article namespace

6. On the width of the marker[edit]

Human
Basepairs Approx. width in Chr.1
(Longest chromosome)
Approx. width in Chr.21
(Shortest chromosome)
248.9Mb
(Chr.1 length)
300px 1599px
46.7Mb
(Chr.21 length)
56px 300px
5Mb 6px 32px
2.3Mb
(Longest human gene length)
2.8px 15px
1.6Mb 2px 10.6px
0.93Mb 1.1px 6px
0.31Mb 0.38px 2px
10kb
(Typical human gene length)
0.012px 0.064px
Mouse
Basepairs Approx. width in Chr.1
(Longest chromosome)
Approx. width in Chr.19
(Shortest chromosome)
195.4Mb
(Chr.1 length)
300px 954.4px
61.4Mb
(Chr.19 length)
94.3px 300px
3.9Mb 6px 19px
2.3Mb
(Longest? mouse gene length)
3.5px 11.2px
1.3Mb 2px 6.3px
1.2Mb 1.8px 6px
0.4Mb 0.6px 2px
10kb
(Typical mouse gene length)
0.015px 0.049px

Gene length is varies from one by one. So marker width also has to change page by page. Here I list up some data which are needed to think about marker width.

As result of some experiments, marker width must be at least 2px, because 1px marker is difficult to detect.

And largest marker width would be 15px (red area in the table at the right).

Length of the longest human genes are... http://www.cshlp.org/ghg5_all/section/gene.shtml

1px width marker
an arrow
2px width marker
an arrow
3px width marker
an arrow
4px width marker
an arrow
5px width marker
an arrow
6px width marker
an arrow
7px width marker
an arrow
8px width marker
an arrow
Red rectangle series

7. Ideograms[edit]

Human chromosome ideograms in svg

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

X

Y

Currently used ideogram set is as above. If you want use different ideogram set, following 5 conditions must be met.

  1. 24 images are needed. (1-22 and XY)
  2. All 24 images must have same image size (same height and same width).
  3. Among all 24 images, pter (terminus of the p-arm, leftist point) and qter (terminus of the q-arm, rightest point) must be set at the same position.
  4. Banding pattern must be drawn in basepair-proportional style. Standard ideograms defined by ISCN are drawn based on actual visual appearance of stained chromosomes under microscope, is not basepair-proportional. (see the table below)
  5. All file names must have same format, changing only in chromosome number. For example, if you created chromosome 1 image named MyPrettyNice_Chr1_Ideogram.png, then the rest of file names should be as follows.
MyPrettyNice_Chr2_Ideogram.png
MyPrettyNice_Chr3_Ideogram.png
MyPrettyNice_Chr4_Ideogram.png
....
MyPrettyNice_Chr9_Ideogram.png
MyPrettyNice_Chr10_Ideogram.png
....
MyPrettyNice_Chr22_Ideogram.png
MyPrettyNice_ChrX_Ideogram.png
MyPrettyNice_ChrY_Ideogram.png

After these 5 conditions are met, you can switch current images into new images, by changing the part of the code where ideogram file name is defined.

2 types of ideogram (we should use bottom one)
Ideogram Description Image The common Difference Cause

Red XN We can not use this type
Chr.7 ideogram of ISCN standard, which is drawn based on actual visual appearance of stained chromosome under microscope. In both images, band order is the same. You can see that band color, from the left to right, is set in the following order...
  •  white 
  •  light grey 
  •  white 
  •  black 
  •  white 
  •  black 
  • ....

This order is the same.

The order is the same. But widths of each bands are different. The salient parts are highlighted in the image below.

The bands which are connected to each other are same band. You can see the difference of their width between the upper and the lower ideogram.

The cause of this difference is that basepair-density is not homogeneous within the chromosome. In some part basepairs are densely packed, and in other part basepairs are sparsely packed.

Green tickY We can use this type
Chr.7 ideogram drawn in basepair-proportional style. As far as I know, all genome browsers (e.g. Ensembl, UCSC and so on) use this style of ideograms.

8. Forward and Reverse strands[edit]

https://www.biostars.org/p/210929/

https://www.biostars.org/p/3908/

http://seqanswers.com/forums/showthread.php?t=39388

In GRCh, as convention, direction from p-arm (short arm) to q-arm (long arm) is forward. The opposite direction is reverse.


From Nelson, Sarah C., et al. Trends in Genetics 28.8 (2012): 361-363.[1]

In all human reference chromosomes, as for other eukaryotes, the plus (+) strand is defined as the strand with its 5' end at the tip of the short arm (Genome Reference Consortium, personal communication, March 27, 2012).

Forward strand Forward strand

Reverse strand Reverse strand

Forward strand Forward strand

Reverse strand Reverse strand

Forward strand Forward strand

Reverse strand Reverse strand

9. Others[edit]

If the following kind of technology is available, it's so nice.

But currently it seems there are no this kind of technology.

Perhaps Graph extension can do something...

10. Pages for test[edit]

  • Dystrophin - long gene (2.3 Mb at Chr.X)
  • RUNX1 - long gene (1.2 Mb at Chr.21)
  • Oct-4 - has many Ensembl IDs
  • Aprataxin - has many RefSeq IDs
  • MT-ND1 - mitochondrial gene
  • IFNA8 - has human data, but mouse data