Talk:Linkage disequilibrium

From Wikipedia, the free encyclopedia
Jump to: navigation, search
WikiProject Genetics (Rated C-class, High-importance)
WikiProject icon This article is within the scope of WikiProject Genetics, a collaborative effort to improve the coverage of Genetics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
C-Class article C  This article has been rated as C-Class on the project's quality scale.
 High  This article has been rated as High-importance on the project's importance scale.

Linkage Disequilibrium in bacteria[edit]

I would like to make a small suggestion to improve the article. In the last paragraph of the introduction it says:-

For example, some organisms (such as bacteria) may show linkage disequilibrium because they reproduce asexually and there is no recombination to break down the linkage disequilibrium.

There is recombination in bacteria, as occurs during Natural Competence for example. I do not question that bacteria do show LD and that this has something to do with the fact that they reproduce asexually (I am not an expert on LD, so I take it on trust that this is the case). Could this explanation be expanded upon, or reworded in such a way so that does not imply that recombination does not occur in bacteria. The Red Lexicon (talk) 14:34, 17 September 2010 (UTC)

I really don't think it's correct to speak of linkage disequilibrium without recombination. You could have linkage disequilibrium in bacteria with recombination, but to say they have it by virtue of being haploid is incorrect.Trashbird1240 (talk) 17:53, 17 September 2010 (UTC)

Linkage disequilibria=[edit]

I am an statistician and probably my remarq is not of great interest but, in a statistical point of view, it should be better to avoid NON-RANDOM association (because the association of alleles is a random process) and to replace the above expression by NON-INDEPENDENT association —Preceding unsigned comment added by (talk) 08:52, 5 October 2010 (UTC)


I removed the statement "and their genetic distance", since it is not relevant to LD. Lower linkage disequilibrium is expected in data sets for loci the farther they are apart. However, LD reflects covariance between alleles thereby comparing joint occurence with a product distribution - irrespective of distance. -sboehringer

Could somebody put up a linkage disequilibium plot with a desciption of what this plot shows. Thanks. —Preceding unsigned comment added by Sarhas (talkcontribs) 19:47, 24 June 2008 (UTC)

I reverted the wording to "non-random association" as this is the standard textbook definition. See Hartl & Clark Principles of Population Genetics 3rd edition 1997. Correlation may imply a particular measure of non-random association, whereas there are many measures of LD. --Lexor|Talk 05:01, 25 January 2006 (UTC)

Isn't there some way to reformat the 2x2 table of haplotype frequencies? The way it is, the entries on the rows abut, so that D is immediately followed by x_21 when in fact they are in different expressions. I tried some ways of putting blanks in between but none worked. Felsenst 23:40, 29 September 2006 (UTC)

I added a section on analysis software and added links to Haploview and PyPop. Jrandall 22:02, 24 October 2006 (UTC)

To Do

-Begin article with a definition of linkage disequilibrium, instead of stating what it is used for. (For instance, if I wrote an entry on bears, I would start off by saying "Massive plantigrade carnivorous or omnivorous mammals with long shaggy coats and strong claws." Rather than "Never feed hungry bears, because then they will start chasing you")

-Someone needs to add info about multiplicative and epistatic fitnesses for the two locus model -- which explains how selection can only produce LD at polymorphic equalibrium when haplotypes have epistatic fitness.

-Start off article with a small section at the top on Linkage EQUILibrium, this should give good contrast to then explain DISequilibrium.

-Add text on 'selective sweeps', its a related topic to that of LD.

-Add example of LD, use HLA gene example.

-Text on how LD and 'selective sweep' can be used to find genes that cause drug resistance.

-Text on how LD can be selectively advantageous, neutral or disadvantageous.

--Mike Spenard 07:13, 1 January 2007 (UTC)

I've removed the statement that LD is generated by epistasis as this is untrue. LD arises as a consequence of mutations being syntenic to each other. When a mutation arises it is then in complete LD with other polymorphisms on that chromosome. This LD is broken down over time by recombination. It may be true that epistasis may selectively favor certain combinations of trans-acting genes and this would affect the fitness of certain haplotypes, but epistasis does not cause LD.

I also feel that using HLA as an example for LD is disingenious. Its a highly polymorphic region with a relatively high number of recombination hot-spots. Starting off with bi-allelic loci (i.e. SNPs) would be a better demonstration of the principles of LD, rather than complicating it with multi-allelic loci. Given the HapMap has data on this in four populations, this would serve as a good start. It would also serve to demonstrate how populations sub-structure affects the range of LD (viz. range of LD in populations of different ages, and the effect of population bottlenecks).

With regards to epistasis there is a formal genetic definition, but how to actually detect epistatic effects is still problematic, thus introducing the concept within a discussion of LD would I feel complicate the article unnecessarily. See Cordell H.J. (2002) Epistasis: what it means, what it doesn't mean, and statistical methods to detect it in humans. Hum. Mol. Genet. 11(20): 2463-2468 for a discussion of this. There is after all an article on epistasis, so this should be pointed to and epistasis between syntenic polymorphisms discussed there.

Slack---line 12:05, 2 August 2007 (UTC)


The statement that

When extending these formula for diploid cells rather than investigating the gametes/haplotypes directly, the laid out principle prevails, the recombination rate between the two loci A and B must be taken into account, though, which is commonly denoted by the letter c.

seems wrong to me. Aside from whether "the laid out principle" is good English, the estimate of D can be made without knowing anything about the recombination fraction between A and B (for example, by the EM method of Hill, 1974). Felsenst 01:34, 1 January 2007 (UTC)

Agreed, 'the laid out principle' needs to go. Also, recomination rate is usually denotated by r (in my expierance), the article I think should use that since r makes symbolic sense more easily. As for D, perhaps D' was ment (as in the value of D for the next generation) D'=(1-r)D  ??? But you are correct, D can be calculated without r. D is the devation from 1 of the sumation of all possible haplotype frequencies. r decays D, which is why I think the above article statement should be changed to apply to D'. [haplotypes A1B1=a,A1B2=b,A2B1=c,A2B2=d] ... D=(ad-bc) right? Could you toss the EM method into this thread? --Mike Spenard 07:03, 1 January 2007 (UTC)
I think you must have meant something else when you said "the [sum] of all possible haplotype frequencies", since the frequencies of everything always adds to 1. D is the frequency of a particular (two-locus) haplotype (say A1B1), minus the expected frequency of that haplotype if the alleles at the two loci are distributed randomly, which is p(A1)p(B1). Felsenst 23:43, 4 September 2007 (UTC)

What's with δ ?[edit]

I think there is a problem with the section on δ. As stated, the formula for it is backwards -- it ought to be h12 - p1 p2, not the other way around. Furthermore there is the puzzling issue of why we have the two measures D and δ when they will be exactly the same number. Can someone enlighten me about this? (I have written a number of papers on linkage disequilibrium starting in 1965 so I believe that I know what I am talking about, but maybe I've missed something). Felsenst 05:38, 29 August 2007 (UTC)

Let me answer my own question. p1 p2 - h12 is not correct. But I had failed to notice that it was h12. However it is more conventional to express it in terms of h11, p1, and p2 whereupon it would be h11 - p1 p2. To express it in terms of h12 the correct formula would be p1(1-p2) - h12. But all this notation is not clear becase h11 is the frequency of haplotype A1B1, but p2 is described as the marginal frequency at the B locus so that it is unclear whether it is for the B1 or the B2 allele. Felsenst 22:35, 26 September 2007 (UTC)

What's with r2 ?[edit]

In the discussion of the r2 measure of disequilibrium it is said that "This however is not adjusted to the loci having different allele frequencies." The whole point of r2 is that it is adjusted for gene frequencies, so this is incorrect. It may not be the best possible way of adjusting, but it tries to do so. This is followed up by a statement that if we take the square root of r2 and give it the sign of D, we get the measure D'. Which is fine, except that this isn't true (try doing both the r2 and D' computations in a case with gene frequencies away from 50% at one or more of the loci and you will see that they don't come out the same). Felsenst 07:08, 6 September 2007 (UTC)

Not necessarily on the same chromosome[edit]

I have reverted the recent edit that described the loci whose disequilibrium as being calculated as "necessarily on the same chromosome arm". This is incorrect -- linkage disequilibrium can be calculated for loci on opposite arms of the same chromosome, or even on different chromosomes. Even when the recombination fraction becomes 50%, linkage disequilibrium can be nonzero. (When disequilibrium is calculated not within a single population but across populations, it can persist for substantial numbers of generations even for unlinked loci). Felsenst 17:19, 6 September 2007 (UTC)

I agree that they need not be on the same arm, but different chromosomes, thats a new one to me! Yes you can have alleles at unlinked loci in disequilibria, but its not linkage, which is what the articles is about (see my above comments with regards to epistasis).
Also why would you calculate LD across populations? Allele frequencies are key to calculating LD, and even under stochastic variation (Kimura's neutral theory) these will vary between populations, so it doesn't even make sense to calculate LD in a mixed pool --Slack---line 17:37, 23 September 2007 (UTC)
Your problem is that you are trying have a logical terminology. The term "linkage disequilibrium" has always been unpopular among population geneticists because it implies that it will be zero when genes are unlinked (which is, as I mentioned, not quite true). Nevertheless the phrase has stuck. So yes, although D is small for unlinked genes, it is not zero and yes, we can calculate it. As for calculating it between populations, D can result from (among other causes) admixture of individuals from different populations, so it makes sense to compare within- and between-population D. Felsenst 06:04, 29 September 2007 (UTC)
I guess my issue is that linkage to me implies exactly that, a physical link. I never said that unlinked genes couldn't display disequilibria, is there an alternative term for describing such disequilibria when loci aren't linked? I've not come across any, that I can think of, although unlinked loci in disequilibriawould make the distinction (although I guess your are saying that there is no distinction, just a historically inaccurate artifact in the nomenclature). I shall have a dig through some books and references to see if I can find the first use of the term (if only to clarify things in my mind).
Fully aware of the effects of admixture and the confounding it causes when performing association mapping (and even the utility of admixture mapping in identifying disease loci). I think I misunderstood the point you were making, which if I understand correctly is that loci can appear to be in LD in admixed populations? Slack---line 15:19, 2 October 2007 (UTC)
I'd go a bit further and say that the unlinked loci don't just "appear" to be in LD in admixed populations, they are in LD in that case. LD is still called LD when the loci actually aren't linked. Furthermore that LD reflects covariation of allele frequencies across the source population, and it is legitimate to analyze it by computing LD both within the population and across the source populations. This was first discussed, as far as I know in the appendix by Timothy Prout to a paper by Jeff Mitton and Richard Koehn in Genetics, 1973 (73: 487-496). Felsenst 23:23, 3 October 2007 (UTC)

Someone has once again modified the article to say that linkage disequilibrium refers to loci that are on the same chromosome. It applies to them, but it can also be calculated for pairs of loci that are on different chromosomes. So I reverted the edit, and modified the text there a little bit more. (Folks, I am a board-certified theoretical population geneticist and did my graduate work under the guy who invented the term linkage disequilibrium, so I know what I am talking about). Felsenst (talk) —Preceding undated comment added 12:44, 1 August 2012 (UTC)

First Paragraph is incomplete[edit]

I'm an undergrad genetics student... but I do know that the first para ends abruptly. Ashton 07:17, 19 November 2007 (UTC)

That's the least of the problems of this page. It is a complete mess. Here are some problems:
  1. The utterly mysterious statement that "It may be instructive to study genetic equilibrium, and its application in the Hardy-Weinberg principle." Yup, it may instructive to study that, or astronomy, or linguistics, but why not discuss linkage disequilibrium instead?
  2. Linkage disequilibrium is given the symbol , then there is some wrong algebra. The expression p1p2-h12 is given which is not equal to but in fact works out to p1p2-p1(1-p2)+D which is 2 p1 p2-p1+D. Then that is (for no reason) equated to h11 h22 - h12 h21 which actually is one way of writing the linkage disequilibrium but which is not at all equal to p1p2-h12.
  3. Next the linkage disequilibrium is called D, without any mention of why the letter suddenly changed, and the discussion goes off sideways into all sorts of discussion of genotype frequencies for no particular reason. In the middle of which is a casual definition of D that happens to be correct but is a tiny fraction of all the discursive stuff.
  4. Then there is a discussion of D' which also describes its rival r2 as not adjusted for gene frequencies, when of course it is (so is D', they're just different adjustments).
  5. Then for no reason whatsoever Tajima's measure of departure from neutrality is described. It uses the symbol D, but is not at all related to linkage disequilibrium.
I wish I had the guts to totally rewrite the page but I'd just get all these authors who were responsible for the mess mad at me. Felsenst (talk) 07:40, 20 November 2007 (UTC)

I've been trying to tidy bits and pieces up myself, by adding references, altering glaring mistakes and so forth but haven't made much headway. Personally I wouldn't get mad (its rather futile ranting in cyber-space), but having been enlightened by yourself I now feel that the terminology in the field should be revised and standardised (viz. above query about linkage disequilibrium, although whether wikipedia is the forum for this is highly unlikely :-) ). Personally I'd defer to yourself Felsenst to make the majority of the changes, but would be more than happy to contribute and help. There are a number of things that need to be addressed first, like

  1. Content : What is the remit of the article. The vague reference to HapMap isn't really useful to the article nor is its adjunct about Ensembl or dbSNP (there are other irrelevant items in the article as well). Worked examples of how admixture can lead to (spurious ;-) ) LD would be useful, as would a few graphs demonstrating the decay of LD over time as a function of the recombination rate/LD measure/strength of epistatic selection. Software for calculating and giving visual representation of LD should be included.
  1. Structure : How the content should be structured. Intro, background (which should give reference to HWeqm and genetic equilibrium and explain why they are pertinent), formulae, worked examples, software, references, links.
  1. Accuracy : All forumlae should be accurate (and referenced).
  1. Referencing : I've attempted to add references where possible, but if the article is to be fully restructured then it should also be fully referenced.
  1. Anything Else? This is just of the top of my head and I've not sat down and given it any hard thought so there are bound to be glaring mistakes.

I wouldn't be too worried about making the changes, if there were some sort of sandbox facility then the article could be written before being changed. I believe there are ways of getting wiki pages locked down if its clear that someone is consistently defacing them. I took it upon myself to merge two coalescent theory pages a while back and expected to have people complaining, but left a notice of the planned merge on each discussion page, waited a month or so, didn't hear anything and went ahead with it (besides, based on the discussion it seems you and I are the only one's watching these pages, the other changes are made by people stumbling across it saying "this is a bit shoddy, lets add my tupence here"). Slack---line (talk) 19:19, 28 November 2007 (UTC)

Well, I'm too busy right now. I was hoping by ranting to impel someone to try to fix it. In general it ought to start out much as it does, defining LD as nonrandom association. Then it should simply define it in terms of the frequency of a gamete: LD is when f(AB) is not equal to p(A)p(B). Then one can talk about the D = f(AB) - p(A)p(B) measure. Then perhaps some discussion of the forces that can cause it including selection on interacting loci, genetic drift, and admixture (migration). Next perhaps some discussion of how genotype frequencies depend on the gene frequencies and D, as well as how there are separate D's for every pair of alleles at these two loci. Then some mention of how they aren't all independent (with m and n alleles at the two loci there are actually (m-1)(n-1) degrees of freedom for D, as the D's for all pairs of alleles are not independent quantities. The I think only two more topics are essential: (1) the higher-order D's can be defined (there's a nice formula by Bill Hill for them, and (2) standardized measures of LD such as D' and r^2 need to be mentioned. Maybe somewhere in here the issue of estimating D from diploid genotype frequencies too. There are many peripheral topics but this would already be fairly long. BTW in the first paragraph both LD and linkage are defined as association. One is association across a population, the other association among gametes produced by a double heterozygote, but this is not clarified. Felsenst (talk) 08:03, 29 November 2007 (UTC)

Linkage disequilibrium measure, D[edit]

Things seem to be improving. Some concerns:

  1. I am not sure why the haplotype frequencies have to be carefully described as "relative frequencies". They don't add up to 1? Actually they do add up to 1 (if there are two alleles at each of the two loci) so describing them as relative is not helpful. They are absolute frequencies, not relative frequencies.
  2. Lewontin and Kojima's paper gives LD its name. That was eliminated. I am not sure whether they were the first to use the letter D, as the reference in the text now implies.
  3. I am told by Monty Slatkin that the historical attribution to Robbins is in fact incorrect -- LD and its mathematics was introduced slightly earlier in a paper by H. S. Jennings, who was the great pioneer of genetics of protists.

Some of the problems I noticed earlier are still around but things are improving. Felsenst (talk) 14:16, 12 December 2007 (UTC)

On a reread of Jennings's paper I am not so sure. He did work out the math of two loci each with two alleles, and correctly. However he did not introduce D but just gave expressions in terms of the four haplotype frequencies. Robbins was the first to recast this in terms of D, which he called . Jennings's paper is: Jennings, H. S. 1917. The numerical results of diverse systems of breeding, with respect to two pairs of characters, linked or independent, with special relation to the effects of linkage. Genetics 2: 97-154. Felsenst (talk) 13:18, 13 December 2007 (UTC)

Oops, actually, Robbins used the symbol , not . Felsenst (talk) 12:56, 14 December 2007 (UTC)

Explained why D goes to zero[edit]

I have revised the explanation of why D approaches zero with random mating. The reference to Crow and Kimura for the derivation has been removed, as most readers will not have access to the book. I still don't understand why linkage disequilibrium is described as two quantities and , with the former defined wrongly. Felsenst (talk) 00:10, 4 September 2008 (UTC)

"Lost it" and made many changes[edit]

Sorry folks, for altering your prose but I made quite a few changes. The misbegotten quantity which was misdefined is gone, some wording improved, the discussion of the relationship of and and whether they are or are not corrected for gene frequencies is changed. I hope you will find it in your hearts to forgive me. Felsenst (talk) 19:54, 5 September 2008 (UTC)


We now need a section showing how in diploids having LD means that we expect an association between genotypes at two loci (I removed a mention of diploids because it got this mixed up with the effect of recombination, something that happens in both haploids and diploids). We also need a section of multi-locus disequilibrium measures (at least, mentioning that they exist). Felsenst (talk) 19:59, 5 September 2008 (UTC)

I am adding a section using Burger/Nagylaki notation for multilocus linkage disequilibrium, but the notation is fairly complex (possibly inappropriate for an encyclopedia). Can you point me to any better sources for this? — Preceding unsigned comment added by Trashbird1240 (talkcontribs) 20:47, 13 January 2012 (UTC)

Ludicrous reference[edit]

A bot came along and changed a reference (one which I had added, so I feel paternal about it) to look like this:

Robbins, R.B. (07/01/1918). "Some applications of mathematics to breeding problems III". Genetics 3 (4): 375–389.

Now exactly why is it important to know that this journal issue was published on 1 July 1918? It's not a newspaper issue. It's a scientific journal volume (and volume 3 of that journal consisted of issues published at various dates of that year). No one but this overenthusiastic bot makes references like this. How do we stop it? It seems to have been reacting to the reference having had a "year" tag but not a "date" tag. So it put in the date. What next? Hours, minutes, and seconds? Felsenst (talk) 12:25, 14 October 2008 (UTC)

HLA stuff too complicated and too technical?[edit]

In my opinion, for what it's worth, the HLA examples are too long, too complex, and have too much technical jargon (such as "codominant") that the reader may not know. On addition there are distinctions not explained, such as LD within populations and LD among (between) populations -- the example used in the second paragraph of this page is between-population LD and the examples used in the HLA section of the article is within-population LD. That whole issue is not yet explained but the distinction is important. In the figure that has appeared at the top of the article the LD comes from assortative mating, an issue barely touched on in the article. The LD in that example is described as "randomly generated" without making it clear whether this is due to random genetic drift (it is mostly due to assortative mating with the differences between curves due to drift). In general, the article is getting too long and messy and is moving away from simple, lucid explanation. Felsenst (talk) 10:54, 18 September 2010 (UTC)

I agree. Ideally the article would have more prose and less mathematical notation. The amount of jargon should be decreased. Brief explanatory phrases should be provided upon introduction of remaining jargon. The worked examples that make up much of this article run counter to the WP:NOTTEXTBOOK policy by using "systematic problem solutions as examples". Emw (talk) 13:41, 18 September 2010 (UTC)

Incremental Revision[edit]

I have published a number of changes that I hope will improve this page. It is not intended to be anywhere near exhaustive; there is much more to be done (to which I hope to contribute someday).

1. I have added a reference to the term "gametic phase disequilibrium". I have cited Falconer & Mackay as the source; if anyone knows of a better one, please change it.

2. I have combined the paragraph beginning "The level of linkage disequilibrium is influenced by a number of factors" and the paragraph on Finnish disease heritage.

3. I have tried to improve some of the math typography, with indifferent success.

4. I have restated the definition of Lewontin's D. The previous version was incorrect, as a negative D would become a positive D' under that definition.

5. I have changed the section heading "Prevalence" to "Example: Human Leukocyte Antigen (HLA) alleles". I found the original heading rather confusing. I have not spent any time on the question of whether that section should exist, or whether it should be replaced by a different example.

6. I have corrected a number of minor typos and grammar or spelling errors.

I hope that this will be regarded as an overall improvement, and welcome any comments. --Glipton1 (talk) 20:29, 25 May 2011 (UTC)

Thanks. I'm embarassed (as a "pro") to have gotten the D' formulas wrong. So in expiation I have corrected the formula for r as it only calculated r-squared and did not say what square root to take. The Finnish HLA example still has the defects I complained about above -- leaving us confused about within versus between-population disequilibrium, for example, and just being too damn long. Felsenst (talk) 03:10, 26 May 2011 (UTC)
Thanks for the comments. I have ambitions to revise or replace the HLA example, but, like everyone else, my time is limited. I will do what I can. I am not a particular expert on LD or population genetics, but I feel that such a central concept as LD merits a thoughtful and careful treatment in Wikipedia.--Glipton1 (talk) 17:37, 26 May 2011 (UTC)
... particularly since it is a concept that is not so easy for a lot of us to grasp. Felsenst (talk) 13:08, 27 May 2011 (UTC)

Independent because of what?[edit]

My original wording that the copies of A and B that are put together by recombination are to be regarded as independent of each other because they were located on "different haplotypes", has recently been changed by another user to "different loci". The phrase "different haplotypes" was probably unfortunate, I agree. But "different loci" does not work either. They are independent because they were on haploid genomes that were independently drawn from the population. If they had been on the same haploid genome, and nonindependent because they were in linkage disequilibrium, they still would be at different loci, so that is not a reason for them to be independent. But by saying "different haplotypes" I improperly implied more than just that they were on different copies of haploid genomes.

So what is good phrase to use here? Suggestions? Felsenst (talk) 22:10, 8 August 2011 (UTC)

I have changed it to "in the two different gametes that formed the diploid genotype", which sort-of works. Felsenst (talk) 05:19, 18 March 2014 (UTC)

Changes to introduction[edit]

I have some proposed changes to the introduction for clarity and accuracy. The changes are up on my sandbox User:PaulGNelson/sandbox. Please let me know what you think. — Preceding undated comment added 20:44, 19 April 2015 (UTC)

Oops, I didn't see this comment before I substantially revised the intro. My apologies! See what you think of my version. I don't think your proposed example is an example of linkage disequilibrium (deviation from the Hardy-Weinberg ratios), which applies to two alleles at the same locus rather than alleles at different loci. There should definitely be a basic example but maybe it would be better if we simplified (and renamed) the "Definition" section since that is the classic example (i.e. a two locus, two allele model). (BTW, to sign your comments use four tildes in succession) Jasonbertram (talk) 05:39, 20 April 2015 (UTC)