# Talk:Likelihood function

WikiProject Statistics (Rated C-class, High-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
High  This article has been rated as High-importance on the importance scale.
WikiProject Mathematics (Rated C-class, High-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 C Class
 High Importance
Field: Probability and statistics

Article does not communicate well. A relatively simple matter turns out to be difficult to understand. —Preceding unsigned comment added by 80.212.104.206 (talk) 13:34, 18 September 2010 (UTC)

I adjusted the wiktionary entry so it doesn't say that the mathematical definition is 'likelihood = probability'. Someone more mathematical than I may want to check to see if the mathematical definition I gave is correct. I defined "likelihood" in the parameterized-model sense, because that is the only way in which I have ever seen it used (i.e., not in the more abstract Pr(A | B=b) sense currently given in the Wikipedia article). 128.231.132.2 03:06, 21 March 2007 (UTC)

This article needs integrating / refactoring with the other two on the likelihood principle and maximum likelihood method, and a good going-over by someone expert in the field. -- The Anome

I emphatically agree. I've rewritten some related articles and I may get to this one if I ever have time. -- Mike Hardy

All was going well until I hit

In statistics, a likelihood function is a conditional probability function considered as a function of its second argument with its first argument held fixed, thus:

Would it be possible for someone to elaborate on that sentence of to given an example? FarrelIThink 06:12, 21 February 2007 (UTC)

## The arrow

Can someone tell me what the arrow notation is suppose to mean? --Huggie (talk) 11:30, 3 April 2010 (UTC)

## Context tag

I added the context tag because the article starts throwing mathematical functions and jargon around from the very beginning with no explanation of what the letters and symbols mean. Rompe 04:40, 15 July 2006 (UTC)

The tag proposes making it more accessible to a general audience. A vernacular usage makes likelihood synonymous with probability, but that is not what is meant here. I doubt this topic can be made readily comprehensible to those not familiar at the very least with probability theory. So I question the appropriateness of the "context" tag. The article starts with the words "In statistics,...". That's enough to tell the general reader that it's not about criminology, church decoration, sports tactics, chemistry, fiction writing, etc. If not such preceeding words were there, I'd agree with the "context" tag. Michael Hardy 23:55, 16 July 2006 (UTC)

## Which came first

Which came first? the common use as in "in all likelihood this will not occur" or the mathematical function?

## Backwards

An earlier version of this page said "In a sense, likelihood works backwards from probability: given B, we use the conditional probability Pr(A|B) to reason about A, and, given A, we use the likelihood function L(A|B) to reason about B. ". This makes sense; i.e. it says it's backwards, and it is.

The current version uses L(B|A) instead, i.e. it says: "In a sense, likelihood works backwards from probability: given B, we use the conditional probability Pr(A|B) to reason about A, and, given A, we use the likelihood function L(B|A) to reason about B. " This does not make sense. It says it's backwards, but it talks as if Pr and L are interchangeable.

How about switching back to the earlier version, and providing a concrete example to help clarify it? Possible example: Given that a die is fair, we use the probability of getting 10 sixes in a row given that the die is fair to reason about getting 10 sixes in a row; or given that we got 10 sixes in a row, we use the likelihood of getting 10 sixes in a row given that the die is fair to reason about whether the die is fair. (Or should it say "the likelihood that the die is fair given that 10 sixes occur in a row"? What exactly is the definition of "likelihood" used in this sort of verbal context, anyway?) --Coppertwig 20:28, 24 August 2007 (UTC)

I agree. and similarly, in the "abstract", currently the last sentence ends in "...and indicates how likely a parameter value is in light of the observed outcome." I do not know if it is ok to use the word "likely" in this way. Clearly, replacing it with "probable" in this sentence would make it terribly wrong by committing the common reversal-of-conditional-probabilities mistake. Therefore: is "likely" clearly distinct (and understood) from probable? Anyways I would suggest to rewrite and say "... and indicates how likely the observed outcome is to occur for different parameter values." Or am I missing something here? Enlightenmentreloaded (talk) 10:01, 28 October 2011 (UTC)

## Likelihood of continuous distributions is a problem

The contribution looks attractive; however, it ignores several basic mathematical facts:

1. Usually likelihood is assessed using not one realization, but a series of observed random variables (independently identically distributed). Then the likelihood expands to a large product. Usually this is transformed by a logarithm to a sum. This transformation is not linear (like that mentioned in the entry), but it attains its maximum at the same point.

2. Likelihood can easily be defined for discrete distributions, where its values are values of some probabilities. A problem arises with an analogue for continuous distributions. Then the probability density function (pdf) is used instead of probability (probability function, pf). This is incorrect unless we use additional assumptions, e.g., continuity of the pdf. Without it, the notion of likelihood does not make sense, although this error occurs in most textbooks. (Do you know any which makes this correct? I did not find any, I did it in my textbook.) In any case, there are two totally different and incomaparable notions of likelihood, one for discerte, the other for continuous distributions. As a consequence, there is no notion of likelihood applicable to mixed distributions. (Nevertheless, the maximum likelihood method can be applied separately to the discrete and continuous parts.)

Mirko Navara, http://cmp.felk.cvut.cz/~navara —Preceding unsigned comment added by 88.146.54.129 (talk) 08:16, 22 February 2008 (UTC)

Just to clarify, by "the contribution" are you referring to the whole article or a particular section or edit? I assume the former.
On (1), well, the log-likelihood isn't mentioned in this article but clearly it isn't itself a likelihood. The invariance of maximum likelihood estimates to transformation is surely a matter not for this article but for the one on maximum likelihood. (I haven't checked that article to see what it says on the topic, if anything).
On (2), I think you've got a point that this article lacks a rigorous definition. I think the more accessible definition is needed too and should be given first. If you want to add a more rigorous definition, go ahead. I'm sure i've seen a measure-theoretic definition somewhere but I'm afraid i've never got to grips with measure theory myself.
When you say "I did it in my textbook", is that Teorie Pravděpodobnosti Na Kvantových a Fuzzy Logikách? I'm afraid i can't locate a copy to consult. Qwfp (talk) 09:34, 22 February 2008 (UTC)

## Area under the curve

"...the integral of a likelihood function is not in general 1. In this example, the integral of the likelihood density over the interval [0, 1] in pH is 1/3, demonstrating again that the likelihood density function cannot be interpreted as a probability density function for pH."

Because the likelihood function is defined up to a scalar, the fact that the integral is 1/3 isn't that meaningful. However, I think we could say that one possibility is twice as likely as another or similarly that the likelihood of $p_H$ being in the range [a,b] is six times as likely as being in the disjoint range [c,d]. Given that $p_H$ can't be less than 0 or more than 1, it seems sensible to normalize the likelihood so that the integral over that range is 1. I think that we could then say that if

$\frac{L(p_H\in [a,b]\,|\,\mathrm{HHT})}{L(p_H\in [0,1]\,|\,\mathrm{HHT})} = \frac{1}{2}$

then there's a 50/50 chance of $p_H$ being in the range [a,b] which would correspond to a normalized likelihood of 0.5. Am I mistaken? Why can't we just normalize to 1.0 and then interpret the normalized likelihood function as a probability density function? —Ben FrantzDale (talk) 17:17, 14 August 2008 (UTC)

"Why can't we just normalize to 1.0"?. There are several reasons. One is that the integral in general doesn't exist (isn't finite). If an appropriate weighting function can be found, then the scaled function becomes something else, with its own interpretation, which would move us away from "likelihood function". However, certain theoretical work has been done which makes use of a different scaling ... scaling by a factor to make the maximum of the scaled likelihood equal to one. Melcombe (talk) 08:45, 15 August 2008 (UTC)
Interesting. Can you give an example of when that integral wouldn't be finite? (This question may be getting at the heart of the difference between "likelihood" and "probability" -- a difference which I don't yet fully understand. —Ben FrantzDale (talk) 12:38, 15 August 2008 (UTC)
An example might be the case where an observation X is from a uniform distribution on (0,a) with a>0. The likelihood function is 1/a for a > (observed X) : so not integrable. A simple change of parameterisation to b=1/a gives a likelihood which is integrable. Melcombe (talk) 13:25, 15 August 2008 (UTC)
Don't forget the simplest case of all: uniform support! Not possible to normalize in this case. Robinh (talk) 14:47, 15 August 2008 (UTC)

It doesn't make sense to speak of a "likelihood density function". Likelihoods are not densities. Density functions are not defined pointwise. One can convolve them, but not multipliy them. Likelihoods are defined pointwise. One can multiply them but not convolve them. One can multiply a likelihood by a density and get another density (although not in general a probability density, until one normalizes). Michael Hardy (talk) 16:00, 15 August 2008 (UTC)

## Needs a simpler introduction?

I believe it is a good habit for mathematical articles on Wikipedia, to start with a simple heuristical explanation of the concept, before diving into details and formalism. In this case I think it should be made clearer that the likelihood is simply the pdf regarded as a function of the parameter rather than of the data.

Perhaps the fact that while the pdf is a deterministic function, the likelihood is considered a random function, should also be adressed. Thomas Tvileren (talk) 07:30, 17 April 2009 (UTC)

What is the scaling factor alpha in the introduction good for? If that's for the purpose of simplification of the maximum likelihood method then (a) it is totally misplaced comment and (b) you could put there any strictly increasing function, not just scaling by a constant. --David Pal (talk) 01:35, 1 March 2011 (UTC)

## Median

For a bernoulli trial, is there a significant meaning for the median of the likelihood function? —Preceding unsigned comment added by Fulldecent (talkcontribs) 16:30, 13 August 2009 (UTC)

The Bernoulli trial has a probability distribution function fP defined by fP(0) = 1−P and fP(1) = P. This means that the likelihood function is Lx defined by L0(P) = 1−P and L1(P) = P for 0≤P≤1. For x=0 the maximum likelihood estimate of P is 0; the median is 1−1/√2 = 0.29; and the mean value is 1/3=0.33. For x=1 the maximum likelihood estimate of P is 1; the median is 1/√2 = 0.71; and the mean value is 2/3=0.67. These are point estimates for P. Some likelihood functions have a well defined maximum likelihood value but no median. Other likelihood functions have median but no mean value. See for example the German tank problem#Likelihood function. Bo Jacoby (talk) 22:27, 3 September 2009 (UTC).

The above is wrong.

• First a minor point. The term "probability distribution function usually means cumulative distribution function.
• What sense can it make to call the number proposed above the "median" of the likelihood function? That would be the answer if one treated the function as a probability density function, but that makes sense only if we assume a uniform measure on the line, in effect a prior, so the proposed median is actually the median of the posterior probability distribution, assuming a uniform prior. It's not a median of the likelihood function. If we assumed a different prior, we'd get a different median with the SAME likelihood function. Similar comments apply to the mean. There's no such thing as the mean or the median of a likelihood function. Michael Hardy (talk) 00:02, 4 September 2009 (UTC)

Comment to Michael:

• The article on probability distribution function allows for the interpretation as probability density function.
• The uniform prior likelihood function, f(P)=1 for 0≤P≤1, expresses prior ignorance of the actual value of P. A different prior likelihood function expresses some knowledge of the actual value of P, and no such knowledge is provided. It is correct that assuming a uniform prior distribution makes the likelihood function define a posterior distribution, in which the mode, median, mean value, standard deviation etc, are defined.

Your main objection seems to be that tacitly assuming a uniform prior distribution is unjustified. Consider the (bernoulli) process of sampling from an infinite population as a limiting case of the (hypergeometric) process of sampling from a finite population. The J expression

  udaf=.!/&(i.@>:) * !/&(- i.@>:)


computes odds of the hypergeometric distribution.

The program call

  1 udaf 10
10 9 8 7 6 5 4 3 2 1  0
0 1 2 3 4 5 6 7 8 9 10


computes the odds when you pick 1 pebble from a population of 10 red and white pebbles. The 11 columns are odds for getting 0 or 1 red pebble, when the number of red pebbels in the population is 0 through 10. The 2 rows are likelihoods for the population containing 0 through 10 red pebbles given that the sample contained 0 or 1 red pebble. The top row shows that 0 red pebbles in the population has the maximum likelihood (= 10). A median is about 2.5 red pebbles = 25% of the population. (10+9+8 = 27 < 27.5 < 28 = 7+6+5+4+3+2+1+0). The mean value is 30% and the standard deviation is 24%.

The prior likelihood function is (of course)

  0 udaf 10
1 1 1 1 1 1 1 1 1 1 1


expressing prior ignorance regarding the number of red pebbles in the population. The maximum likelihood value is undefined; the median and the mean are both equal to 50% of the population, and the standard deviation is 32% of the population.

In the limiting case where the number of pebbles in the population is large, you get (unnormalized) binomial distributions in the columns and (unnormalized) beta distributions in the rows.

  5 udaf 16
4368 3003 2002 1287  792  462  252  126   56   21    6    1    0    0    0    0    0
0 1365 2002 2145 1980 1650 1260  882  560  315  150   55   12    0    0    0    0
0    0  364  858 1320 1650 1800 1764 1568 1260  900  550  264   78    0    0    0
0    0    0   78  264  550  900 1260 1568 1764 1800 1650 1320  858  364    0    0
0    0    0    0   12   55  150  315  560  882 1260 1650 1980 2145 2002 1365    0
0    0    0    0    0    1    6   21   56  126  252  462  792 1287 2002 3003 4368


Study the finite case first, and the infinite case as a limit of the finite case, rather than to begin with the infinite case where a prior distribution is problematic. It is dangerous to assume that lim(f(x))=f(lim(x)). Bo Jacoby (talk) 10:00, 4 September 2009 (UTC).

## graph

The likelihood function for estimating the probability of a coin landing heads-up without prior knowledge after observing HHT

How was this graph generated? Is there a closed form for this calculation? Is there a closed form for given # of H and # of T ? —Preceding unsigned comment added by Fulldecent (talkcontribs) 17:46, 13 August 2009 (UTC)

The expression
$\binom n i p^i(1-p)^{n-i}$
is for fixed n,p a binomial distribution function of i, (i=0,..,n), and for fixed n,i a continuous (unnormalized) beta distribution of p, (0≤p≤1). So the graph is simply
$p^2(1-p)\,$
Bo Jacoby (talk) 12:33, 20 August 2009 (UTC).

## Probability of causes and not probability of effects?

The definition given here is the opposite that given by D'Agostini, Bayesian Reasoning in Data Analysis (2003). From pp. 34-35: "The possible values $x$ which may be observed are classified in belief by $f(x|\mu)$. This function is traditionally called `likelihood' and summarizes all previous knowledge on that kind of measurement..." In other words, it is the probability of an effect $x$ given a parameter (cause) $\mu$. The definition given in this entry, proportional to the probability of a cause given the effect ($f(\mu|x))$ seems more useful, as the concept is more important, but is it possible that there is more than one definition in use in the literature? LiamH (talk) 02:10, 4 October 2009 (UTC)

## putting x and theta in bold

since P(x|theta) is describing sets of data points (as if a vector), shouldn't it be put in bold?

theta represents a vector (or set) of parameters, and x represents a vector of data points from a sample.