# Talk:Maximum likelihood estimation

(Redirected from Talk:Maximum likelihood)
WikiProject Tree of Life (Rated B-class, Low-importance)
This article is within the scope of WikiProject Tree of Life, a collaborative effort to improve the coverage of taxonomy and the phylogenetic tree of life on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
B  This article has been rated as B-Class on the project's quality scale.
Low  This article has been rated as Low-importance on the project's importance scale.
WikiProject Statistics (Rated B-class, Top-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

B  This article has been rated as B-Class on the quality scale.
Top  This article has been rated as Top-importance on the importance scale.
WikiProject Mathematics (Rated B-class, High-importance)
This article is within the scope of WikiProject Mathematics, a collaborative effort to improve the coverage of Mathematics on Wikipedia. If you would like to participate, please visit the project page, where you can join the discussion and see a list of open tasks.
Mathematics rating:
 B Class
 High Importance
Field: Probability and statistics
One of the 500 most frequently viewed mathematics articles.

## removal

I removed this from the article, until it can be made more NPOV and more encyclopedic. Currently reads more like a list of observations than true encyclopedic content and needs more explanation. --Lexor|Talk 07:25, 5 Aug 2004 (UTC)

Maximum likelihood is one of the main methods used by frequentist (i.e. non-Bayesian) statisticians. Bayesian arguments against the ML and other point estimation methods are that
• all the information contained in the data is in the likelihood function, so why use just the maximum?Bayesian methods use ALL of the likelihood function and this is why they are optimal.
• ML methods have good asymptotic properties (consistency and attainment of the Cramer-Rao lower bound) but there is nothing to recommend them for analysis of small samples
• the method doesn't work so well with distributions that have many modes or unusual shapes. Apart from the practical difficulties of getting stuck in local modes, there is the difficulty of interpreting the output, which consists of a point estimate plus standard error. Suppose you have a distribution for a quantity that can only take positive values, and your ML estimate for the mean comes out at 1.0 with a standard error of 3? Bayesian methods gives you the entire posterior distribution as output, so you can make sense of it and then decide what summaries are appropriate.

## Style

This whole article reads like an 80's highschool textbook. As a matter of fact, a lot wikipedia's articles on difficult to understand subjects read like they're out of an 80's highschool textbook, making them only useful to people who have already know the subject back to front, making the entire wikipedia project a failure. —Preceding unsigned comment added by 4.232.174.170 (talkcontribs)

Clarity issues in a statistics article (a subject that is less than clear) should not be used to make the inference that "the entire wikipedia project a failure" Rschulz 23:56, 1 Mar 2005 (UTC)
This article was never intended to be accessible to secondary school students generally (although certainly there are those among them who would understand it). I would consider this article successful if mathematicians who know probability theory but do not know statistics can understand it. And I think by that standard it is fairly successful, although more examples and more theory could certainly be added. If someone can make this particular topic comprehensible to most people who know high-school mathematics, through first-year calculus or perhaps through the prerequisites to first-year calculus, I would consider that a substantially greater achievement. But that would tak more work. Michael Hardy 00:40, 10 Jan 2005 (UTC)
I would consider this article a failure if only "mathematicians who know probability theory" can understand it. I got to the article via a link from Constellation diagram, and learned nothing from it that I didn't already know from the other article—not even what it is used for. 121a0012 05:41, 14 June 2006 (UTC)
I think there is kind of a style struggle in mathematics whereas some people prefer math text to say what it means and discuss itself introspectively, and others prefer a terse no nonsense style. I find the discussion based approach to be healthier. One of the problems though is it isnt neccessarily encyclopedic to use that tone. I mean most important math can be stated in one or two sentence. Doesnt mean that the sentence will be approachable. But it will be factually complete saying all there is to say about the subject. So then an encyclopedia editor struggles with the fact that you have to say more to make it approachable, but at the same time you can say less and say all there is to say. It is a problem unique to mathematics. Jeremiahrounds 13:01, 20 June 2007 (UTC)
But most mathematicians who know probability theory do not know this material, so it should not be considered a failure if they understand it. That is not to say it should not be made accessible to a broader audience. But that will take more work, so be patient. Michael Hardy 17:42, 14 June 2006 (UTC)
I totally agree with the original complaint that this along with many other wikipedia math articles are too heavy going. For that reason I replaced the first paragraph with something more digestible. I see no reason to dive into using math symbols right in the first paragraph. --Julian Brown 02:50, 30 August 2007 (UTC)
As a university student learning statistics, I think this article needs improvement. It would be good if a graph of the likelihood of different parameter values for p was added (with the maximum pointed out) to the example. This addition would require adding some specific data to the example. Also, the example should be separated from the discussion about MLE, to make sure people understand that the binominal distribution is only used for this case. The reasons why it is good to take the log of likelihood are not discussed. Further the discussion about what makes a good estimator (and how MLE is related to other estimators) could be expanded. Rschulz 23:56, 1 Mar 2005 (UTC)
The "left as an exercise to the reader" part is definitely gratuitous and needs to go. I came to this page to learn about the subject, not for homework problems.
a user in CA Good article, don't be so hard on the author(s), of course could be better, but most of us have day jobs, but I would change the notation, as this was confusing " The value (lower-case) x/n observed in a particular case is an estimate; the random variable (Capital) X/n is an estimator." seems to conflict with the excellent example at the end for finding maximum likelihood x/n in a bionomial distribution of x voters in a sample of n (without replacement). Now, next question, can anybody explain the Viterbi algorithm to a high-schooler? 01 March 2004
I don't see the conflict. The lower-case x in the example at the end is not a random variable, but an observed realization. Michael Hardy 22:50, 2 Mar 2005 (UTC)

## Difference between likelihood function and probability density function

I guess the line "we may compute the probability associated with our observed data" is not correct. Because the probability of a continuous variable for any given point is always zero. The correct statement would be "we may compute the likelihood associated with our observed data". For more argument please see [1]

Your assertion is correct, but your header is not. That is NOT the difference between the likelihood function and the density function. One is a function of the parameter, with the data fixed; the other is a function of the data with the parameter fixed. Michael Hardy 02:16, 13 May 2006 (UTC)

where is the spanish version?

Is it possible to get the text shown as it should read by people who don't know latex code instead of symbols such as x_1 for subindexes or x^1 for superindexes, etc. I understand those but not other symbols which are used in this article, and in any case they hamper reading. Thanks! Xinelo 14:47, 21 September 2006 (UTC)

Maybe it was a temporary problem; they look OK to me now. Michael Hardy 15:06, 21 September 2006 (UTC)

## Untitled

Great article! Very, very clear. 18.243.6.178 04:08, 13 November 2006 (UTC)

## Thank You!!

This article is fantastic. It is more understandable than my class notes and has helped me greatly for my class. I also really appreciated the use of examples. Many many thanks to the people who wrote it!. Poyan 8:09, 6 December 2006 —The preceding unsigned comment was added by 128.100.36.147 (talk) 13:09, 6 December 2006

## Sloppiness

You really sould say something about the second derivative test. You nonchalantly claimed you reached a maximum. You could very well have found a minimum. Don't reinforce bad habits. This is especially true for the Normal case. Just because the gradient is zero does not mean you have a local maximum...and it's not especially trivial to just brush off.(ZioX 18:06, 19 March 2007 (UTC))

I don't know which "you" is addressed, but I agree with the spirit of the comment. But the details of the comment fall short. The second-derivative test may prove a local maximum, but here we need a global maximum. There are various ways to prove there is a global maximum, not all of them involving second derivatives. For example, suppose you show that L(θ) increases as θ goes from 0 to 3, and decreases as θ goes from 3 to ∞, and the parameter space is the interval from 0 to ∞. Then you've got a global maximum at 3, without benefit of second derivatives. Or suppose you've shown that L(θ) is differentiable everywhere, and because the parameter space is compact, there must be a global maximum somewhere, and furthermore L(θ) is 0 on the boundary and positive in the interior. Then the global maximum must be reached at a critical point in the interior. If next it turns out that there is only one critical point in the interior, then you've got it again, and again without second derivatives. This is not at all an unusual situation in elementary MLE problems. Michael Hardy 18:30, 19 March 2007 (UTC)

## Is this true

From the bias section: "we can only be certain that it is greater than or equal to the drawn ticket number." Is this true? Wouldn't it be the less than or equal to?--Vince |Talk| 05:00, 15 May 2007 (UTC)

disregard: It was the wording that confused me. The article Bias of an estimator phrases the problem in the manner I was thinking. I may try to make it clearer. --Vince |Talk| 06:37, 15 May 2007 (UTC)
It was true but not relevant, so I dropped it when I expanded this section. RVS (talk) 00:17, 25 December 2008 (UTC)

Hi all, The article is very well written from a statistics POV.

MLE is used ubiquitously in phylogenetic analysis and cladistics in genetics and evolutionary biology. It would be great if someone could include a section on how MLE can actually be applied to such studies, with some examples.

Also, I, as an amateur biologist, know where MLE is used, but do not have an intuitive understanding of the technique. A section that would provide the layman with such a perspective (of whats actually happening) would be great...

Indiaman1 19:30, 30 June 2007 (UTC)indiaman1

I think the article is not "well written" "MLE is a popular statistical method" - I think such comments should be left to established textbooks. In Germany it could be Fahrmeir (et al.), may be others as well. These professors with the 3rd + x editions of basic statistics. There should be mentioning of competitive measurements (confidence intervals, potential misuse of the p-value (deciding after the fact which of the confidence levels to use etc.)) For a non-english (non-american) it would be interesting who are considered the leading statistics profs in the UK etc. [Gaschroeder] -- Gaschroeder (talk) 14:47, 3 October 2010 (UTC) -- Gaschroeder (talk) 14:51, 3 October 2010 (UTC)

## Non-independent variables

I have added a section on non-independent variables. I hope this proves useful to someone :) Velocidex (talk) 04:44, 19 March 2008 (UTC)

I added a tie back to article topic by mentioning the likelihood function, which should possibly be the main thing being specifically discussed rather than the density function. Melcombe (talk) 10:45, 19 March 2008 (UTC)

For generality this section really needs something said about mixed discrete-continuous distributions. Melcombe (talk) 10:45, 19 March 2008 (UTC)

## Mathematical precision

Someone wrote above that this article reads (read) like a 1980's textbook. In my opinion that was the case. For instance, the discussion of the asymptotic properties of maximum likelihood estimation could have been taken straight out of many standard textbooks, but an intelligent person can realise that the authors of those books either don't know what they are talking about or are hiding things from the reader. I added a sentence referring to modern mathematical results on the maximum likelihood estimator (modern: these results have been known since the 60's but still did not permeate into standard textbooks). I hope the result still makes sense to the non-expert. Gill110951 (talk) 07:12, 5 May 2008 (UTC)

## Typography

I see mixed use of ${\displaystyle {\mathcal {L}}}$ and ${\displaystyle L}$ for the likelihood function. The likelihood function page itself only uses ${\displaystyle L}$. Which is it? (Or do they mean slightly different things?) —Ben FrantzDale (talk) 12:56, 14 August 2008 (UTC)

there is a pw requirement on the link to the tutorial —Preceding unsigned comment added by 193.157.202.64 (talk) 13:04, 20 August 2008 (UTC)

## applications list

The current article lists a mish-mash of applications. I think a distinction needs to be made between application areas that are various statistical methodology methods (univariate models, structural equation models, etc.) vs. application areas that are different fields (agriculture, communications, business research, etc.). Either type is perhaps too broad to list in this article, but i don't think the types should be mixed.

The article currently mentions "...applications in various fields, including:

Hope this comment is helpful. doncram (talk) 21:28, 30 December 2008 (UTC)

(discussion section separated out under this title later)

I think the second sentence in the "Principles" section is ambiguous. The sentence currently reads, in part, "We draw a sample x_1,x_2,\dots,x_n of n values from this distribution [...]". I think it would be much clearer to specify exactly which "distribution" is being referred to by "this distribution". The current wording makes it unclear whether we're looking at one member of the family of distributions, or instead the PDF itself. —Preceding unsigned comment added by Diracula (talkcontribs) 21:08, 27 January 2009 (UTC)

Your point escapes me. It is of course referring to one member of the family, and that one member does have a pdf. And that one could be ANY of the members of the family. "Amibiguous" normally means there are two ways to construe the sentence. Just what those two ways are is unclear. Your way of phrasing it seems to suggest that the two ways are (1) We're looking at one member of the family of distributions; and (2) We're looking at the pdf. I don't see how the sentence can be read as giving us that choice. Before that, you say the ambiguity is about the question of WHICH DISTRIBUTION it is. That's quite a different thing from choosing between the statements I labeled (1) and (2) above, since the choice between (1) and (2) is not a choice of which distribution it is. Nor can I find any ambiguity about which distribution it is. So I find your comments completely cryptic. Michael Hardy (talk) 22:31, 27 January 2009 (UTC)

## Consistency

Why is it listed that "The MLE is asymptotically unbiased, i.e., its bias tends to zero as the sample size increases to infinity" under "Under certain (fairly weak) regularity conditions" when one of the conditions listed is "The maximum likelihood estimator is consistent." Isn't being consistent the same as being asymptotically unbiased? cancan101 (talk) 00:11, 3 March 2009 (UTC)

Technically they are not the same thing; consistent includes the spread of the distriburion reducing, while the othef only needs the mean to approach the right value. Your point however is broadly correct, as consistency almost implies asymptotically unbiased (one exclusion being that the mean stll need not exist for consistency). I think what is meant is really that unbiasedness usually holds under similar conditions to those needed for consistency. The level of technical detail in the article does not naturally incline to including more detail. I think it is meant to be a general warning not to expect that maximum likelihood will always "work".

## Article title

Why was this article renamed "Maximum likelihood" instead of "Maximum likelihood estimation"? The latter seems more appropriate and I suggest renaming it back. -Roger (talk) 15:52, 25 March 2009 (UTC)

I agree - why the change? The article even starts "Maximum likelihood estimation (MLE) ..." 128.61.125.131 (talk) 14:55, 28 January 2010 (UTC)
Notionally, maximum likelihood is more general, with some things worth saying that are not about "maximum likelihood estimation", but there probably not be very much. However, none of this appears (yet). There is also a question of whether it should be "estimate" or "estimation" .... this would preferably aggree with other article titles, such as "minimum distance". Melcombe (talk) 10:51, 29 January 2010 (UTC)
We have M-estimator, Extremum estimator, Bayes estimator, but Maximum spacing estimation, Maximum a posteriori estimation, Minimum distance estimation, and then there is also Generalized method of moments. I’d vote for “Maximum likelihood estimation”, because that’s how it is usually defined in econometric textbooks (MLE). Although the abbreviation “ML” can also be seen occasionally.  … stpasha »  18:42, 29 January 2010 (UTC)

## Error in pdf formula?

Shouldn't this:

The joint probability density function of these ${\displaystyle n}$ random variables is then given by:
${\displaystyle f(x_{1},\ldots ,x_{n})={\frac {1}{2\pi {\sqrt {{\text{det}}(\Sigma )}}}}\exp \left(-{\frac {1}{2}}\left[x_{1}-\mu _{1},\ldots ,x_{n}-\mu _{n}\right]\Sigma ^{-1}\left[x_{1}-\mu _{1},\ldots ,x_{n}-\mu _{n}\right]^{T}\right)}$

be this?:

The joint probability density function of these ${\displaystyle n}$ random variables is then given by:
${\displaystyle f(x_{1},\ldots ,x_{n})={\frac {1}{(2\pi )^{n}{\sqrt {{\text{det}}(\Sigma )}}}}\exp \left(-{\frac {1}{2}}\left[x_{1}-\mu _{1},\ldots ,x_{n}-\mu _{n}\right]\Sigma ^{-1}\left[x_{1}-\mu _{1},\ldots ,x_{n}-\mu _{n}\right]^{T}\right)}$

as per Multivariate_normal_distribution#General_case? Aaron McDaid (talk - contribs) 22:10, 2 April 2009 (UTC)

I've changed it to this:
${\displaystyle f(x_{1},\ldots ,x_{n})={\frac {1}{(2\pi )^{n/2}{\sqrt {{\text{det}}(\Sigma )}}}}\exp \left(-{\frac {1}{2}}\left[x_{1}-\mu _{1},\ldots ,x_{n}-\mu _{n}\right]\Sigma ^{-1}\left[x_{1}-\mu _{1},\ldots ,x_{n}-\mu _{n}\right]^{T}\right)}$
(Alternatively, one could put (2π)n under the radical.) Michael Hardy (talk) 01:42, 3 April 2009 (UTC)

This is one of the rare articles that gives synopses of the related articles: Isn't it better just to give the links?

I humbly suggest these alternatives, which are usually shorter and which in some cases (e.g. abuction) avoid objectionable statements.

I put ?? after synopses that should be removed imho. Kiefer.Wolfowitz (talk) 18:59, 24 May 2009 (UTC)

"Isn't it better just to give the links?" No. It is at least helpful to have some indication of relevance. For example, why is there is link to censoring? But it might be better to reduce the number of "see alsos" by creating brief subsection with headings like "alternative estimation methods", so as to better group the topics. Melcombe (talk) 09:42, 27 May 2009 (UTC)

## Weak consistency or strong consistency?

As i understand, consistency is acually weak consistency, i would like to know if there is proof about strong consistency of the maximum likelihood? —Preceding unsigned comment added by 132.72.51.67 (talk) 16:38, 25 August 2009 (UTC)

## Insufficient content

The article provides a great layman description of the concept of maximum likelihood, however the rigorous mathematical treatment of the same topic is also necessary. What use is the claim “under certain fairly weak conditions MLE is consistent” if those conditions are never spelled out. What use is the “asymptotic normality” property if the article doesn’t even state the asymptotic variance of the estimator.

The “Maximum spacing estimation” article lists several examples when the MLE fails: certain heavy-tailed distributions, certain mixture distributions, certain 3-parameter models. It would be nice to know why those examples don’t work (at least they are not the case of boundary parameters). ... stpasha » talk » 10:31, 2 September 2009 (UTC)

## (Dis?)motivational example

So there used to be a section called “Motivational example”. I've looked at it for a long time, and finally concluded that the best way to improve it would be to remove it altogether. This example simply breaks the flow. It starts using the MLE method without explaining how and why it is used so. It uses an example with 1 observation, which is although legal, but quite counterintuitive since no properties of such estimator can be stated. I think we have enough examples in the “Examples” section, so that we don't need this one too.  … stpasha »  07:50, 24 January 2010 (UTC)

## “Convergence of maximum likelihood estimator” gone

This section ([2]) has been an ugly duckling from the beginning, but despite the selfless attempts of Michael Hardy, it can never grow in a swan… There is just so much wrong with that piece, for example:

1. It silently assumes that θ is a scalar, although the article states otherwise;
2. It concludes the “proof” with Pr[Sn(θ)>0] → 0, while in fact this probability is equal to 1 for all n.
3. The general idea to prove a theorem without stating it, and without specifying the assumptions needed for it, is faulty.

Sentence: removed.  … stpasha »  08:11, 24 January 2010 (UTC)

## Uniqueness?

This article made a lot of statements about "the" mle, when this seem contrary to the literature. Old examples of Cramér show the importance of including local minima (and the problems of restricing attention to only maxima).

Second, this article stated a lot of nice properties of the mle, and included interpretations that the (?) MLE method was the best. But many other estimators have those properties, which may be proved under wider conditions---see the maximum probability estimator of Wolfowitz (no relation!!), and Le Cam on the superior properties of Bayes methods and a lot of examples where the MLE has trouble. I tried to soften uniqueness claims ("the"), where this was possible. No doubt others can improve my edits. Thanks! Kiefer.Wolfowitz (talk) 00:36, 25 January 2010 (UTC)

I'm not sure what it is that you're trying to say. The MLE is one of the generic estimation techniques, applicable to a wide range of problems. There are typically other methods which can be applied to same problems, and some of those methods are as good as MLE. But they can't be better than MLE (at least up to the second-order), because MLE is already efficient — at least as long as we're talking about the MSE loss function.
The maximum probability estimator — what is it? As for the “Bayes estimator” — there is no such thing. There is a bayesian school of thought for estimation, and within that school many estimators can be constructed. Neither of those estimator is “superior” to MLE, because they work with different model: they require that additional prior knowledge about the value of the parameter was available.
I'm not saying that MLE is like “best from the best” or anything like that, but the article should be kept encyclopedic, and any comparison to other estimation methods be done in a separate section (or maybe even a separate article).  … stpasha »  08:18, 25 January 2010 (UTC)
On the first point, the article does have "For certain problems the maximum likelihood estimates may not be unique, or even may not exist" fairly early on. There does seem scope to add some new sections to cover the non-uniqueness case from both a theoretical-asymptotic viewpoint and a practical one. Another thing possibly not mentioned fully enough is the case where estimates are on the boundary of the parameter space (asymptotic properties). Melcombe (talk) 10:48, 25 January 2010 (UTC)

## Proposed restructure

Earlier versions of this article were fairly readable (for example this version of early January), but the recent addition of so-called "proofs" has ruined this. Can we improve things by just moving most of this over-technical stuff towards the end of the article (after the existing "see also" section, which could be renamed) under a heading like "Theoretical details". This would be in line with the guidance for maths articles, that it is OK to start simple and become more thechnical later, even at the risk of some repetition. Melcombe (talk) 12:39, 26 January 2010 (UTC)

This is a good idea. Another idea would be to have a self-standing article entitled "Proofs of properties of MLE". Some articles on technical subjects have informal and formal versions; for example, see the gentle Introduction to M-theory and the high fallutin' M-theory. Kiefer.Wolfowitz (talk) 15:42, 26 January 2010 (UTC)
I agree that the subsections “Consistency” and “Asymptotic normality” still require lots of work. And that’s why I put the tags {{cleanup}} there. In particular, the material from the “Asymptotics” subsection should be distributed between the first two. But I’m not sure if moving the material all the way out of the article is a good idea… Perhaps that guidance for mathematical articles can be interpreted section-wise, like start the section simple, and then proceed to more advanced theory.  … stpasha »  18:52, 26 January 2010 (UTC)
The previous strtucture of the model was that there was a readable outline of all the properties of MLE, which must be retained but which is being destroyed by these supposed "proof" intrusions. My own view is that "proofs" should typically not appear on wikipedia, particularly when they are text book stuff, but that the theory necessary for a proper description of what is being discussed should appear. Thus I am against "Proofs of ... articles" when they have no encylopedic content (ie no proofs for the sake of having proofs), as they would here. Proofs would be much better dwealt with by just giving appropriate citations. Melcombe (talk) 10:23, 27 January 2010 (UTC)

## Discrete case

So, does anybody know if there is a justification for the use of MLE in case when either the parameters or the data belong to a discrete set?  … stpasha »  19:48, 26 January 2010 (UTC)

MLE can be proved to be have good properties for one-parameter exponential families, e.g. binomial, negative binomial, Poisson, etc. A good reference is the work of Erling Andersen, who especially emphasized the importance of conditional maximum likelihood estimation, which avoids the inconsistency of the MLE in many cases. (For a counter-example to conditional mle, see Deb. Basu's selected writings (Ghosh, ed.).) MLE has trouble with multiple parameters, and the bestiary of likelihoods (partial, profile, etc.) have worse properties than MLE with one parameter. Kiefer.Wolfowitz (talk) 22:49, 26 January 2010 (UTC)
Cox's Principles of Statistical Inference ISBN 0-521-68576-2 has (page 143) discussion of a case where the mean of a normal distribution is known to be an integer... but this more in an "inference" context than "estimation". Page 144 in the same book has the example of a mixture of two normals being a case where standard asymptotic theory doesn't apply, which relates to another earlier question. Melcombe (talk) 10:32, 27 January 2010 (UTC)
Perhaps the most important condition for the success of MLE in irregular cases is the following:
* MLE was used on this irregular problem by David Cox!
Cox's recent Principles mentions a practical example where the mle has substantial bias, making it dangerous in application, for optimal design of experiments for logistic regression, citing Rolf Sundberg (EM method) and Vågero, I believe. (Cox and Sunberg are usually fond of MLE.) Kiefer.Wolfowitz (talk) 14:17, 27 January 2010 (UTC)
On discrete parameters, the case where the mean of a normal distribution is known to be an integer, is also used for two Examples by Kendall&Stuart (Advanced Th. of Statistics, Vol 2, 3rd Edition, Examples 18.21-2). They indicate that the second provides an example where the MLE is not consistent. Melcombe (talk) 14:37, 1 February 2010 (UTC)

## I object

I object to the blatantly sexist and ageist example being used as an example of a normal distribution in nature, namely the height of adult female giraffes. I would like to suggest a less controversial example: the mass of mouse turds divided by the mass of the mouse producing them. This will be approximately normally distributed, and has the added advantage of being independent of age and (very probably) sex, thus avoiding an unhealthy obsession with age and sex. It does however suffer from a particularly mouse-o-centric view of the universe, and is therefore not perfect. I am wracked with guilt over this flaw, but lacking a better example, I offer it as perhaps a temporary improvement. PAR (talk) 22:50, 24 March 2010 (UTC)

## Who is Pfanzagl?

One of the references is to Pfanzagl. Is this a made-up name or something? It shows up once in the entire page, so it's not a valid reference/citation. —Preceding unsigned comment added by 130.58.92.245 (talk) 03:04, 29 June 2010 (UTC)

The inadequate reference has been improved, now reading "Pages 207-208 in Pfanzagl, Johann; with the assistance of R. Hamböker (1994). Parametric Statistical Theory. Berlin: Walter de Gruyter. ISBN 3-11-01-3863-8, 3-11-014030-6 Check |isbn= value: invalid character (help). MR 1291393.". Kiefer.Wolfowitz (talk) 11:37, 29 June 2010 (UTC)
Who is Pfanzagl? Professor Dr. Johann Pfanzagl may have been the leading mathematical statistician in Austria and perhaps all of the German speaking countries from 1970-1995, say. In his early work, Pfanzagl solved the outstanding problem of von Neumann and Morgnenstern of showing that expected utility theory could be axiomatized with subjective probability (essentially compatible with von Neumann's approach), and wrote a related monograph on measurement theory. He then worked on statistical inference, with deep results on median-unbiased estimation and exponential families. Following Le Cam and Hajek, Pfanzagl has been one of the architects of asymptotic theory, including both parametric and semiparametric approaches; Le Cam credits Pfanzagl with introducing tangent cones and spaces, and these objects are now standard in advanced graduate books. Although Pfanzagl clearly states that MLE has no good finite-sample properties, he has contributed one of the best convergence analyses of maximizing the conditional likelihood; Pfanzagl also includes simulation studies for assessing the performance of MLE and other methods, which often show that the MLE behaves quite well in moderately large samples. Kiefer.Wolfowitz (talk) 11:37, 29 June 2010 (UTC)

## Vandalism: Administrators, consider raising the protection level

The vandalism intensity has increased lately. Could the level of protection be increased to prevent damage by IP editors? Thanks!  Kiefer.Wolfowitz  (Discussion) 13:11, 13 March 2011 (UTC)

## Revert

I'm reverting the last edit [3] by Kiefer.Wolfowitz due to several reasons: (1) For many purposes, the maximum-likelihood estimator has poor theoretical properties --- not explained in the following text (what poor properties, which purposes?), and this sentence didn't fit into the paragraph anyway, (2) ... mathematical statisticians consider a related estimator that finds critical points of the log-likelihood function. -- not true, since right after you find the MLE estimator you compute the information matrix (−Hessian of the objective function), and verify that this matrix is positive definite, which means that you have the maximum; (3) When evaluated on a sequence of observations from the true distribution, a subsequence of the estimator-evaluations ... --- saying the same as before, only in far more complicated language; (4) ... , for some (unspecified) sample size n --- stated as such the property becomes false: for a given n (even unspecified) you cannot estimate with arbitrary precision, however for arbitrary precision you can find the right sample size n, as was stated initially (and why n suddenly bold?); (5) if the limiting likelihood function (θ|·) has any global maximum at θ0 then it is unique. --- again not true. Identification condition is necessary and sufficient for to have the global maximum (assuming the function is well-defined), for example see Lemma 2.2 in Handbook of Econometrics 36. // stpasha » 18:43, 16 March 2011 (UTC)

I am on the road and so I cannot reference everything. I shall reply more fully later. Considering zero-score estimators is standard: See Ferguson's article,
• Ferguson, Thomas S. (1982). "An inconsistent maximum likelihood estimate". Journal of the American Statistical Association. 77 (380): 831–834. JSTOR 2287314.
which I believe is cited in the enthusiastic book of Pawitan.  Kiefer.Wolfowitz  (Discussion) 08:17, 17 March 2011 (UTC)
BTW, it is possible to link to Handbook of Economics chapters:

## Are there other ways of estimating?

It basically starts something like "MLE is a method for making estimates of ...". Well, are there other ways of doing so? Other generic ways? Do these methods all give the same estimates, or is making estimates not an exact science? 80.162.194.33 (talk) 09:22, 4 August 2011 (UTC)

Please ask such questions at the help desk or at an internet chat site. Talk pages are for discussing improvements to the article. 10:55, 4 August 2011 (UTC)
The anon user might have meant that we should address these concerns in the article itself. If so, that's a legitimate use of a talk page. - dcljr (talk) 00:53, 3 April 2013 (UTC)

## Kolmogorov in the introduction

The introduction section provides a very useful starting point for non-experts reading the page, with the exception of the following sentence: "In the Kolmogorov structure function one deals with individual strings." I personally find it a bit strange that this appears here. If I understood what it is dealing with, I'd relegate it to further down in the article, but I don't. Would anyone care to move this sentence or improve it to make it more congruent with the rest of the introduction?Jimjamjak (talk) 15:15, 25 January 2012 (UTC)

Good point. I've just removed that sentence. It was added last November by Vitanyi (talk · contribs), who heavily edited 'Kolmogorov structure function'. It's relationship to maximum likelihood isn't clear, and it's certainly not suitable for the lead. If someone else understands it better and thinks it deserves a mention further down the article, that's fine with me. Qwfp (talk) 15:39, 25 January 2012 (UTC)

## Subsequence of the sequence converges in probability

In the article it says: 'Consistency: a subsequence of the sequence of MLEs converges in probability to the value being estimated.' I don't agree with it. The sequence converges in probability to the value. Why do you have to take the subsequence? 95.113.187.130 (talk) 21:05, 29 October 2012 (UTC)

Agreed. I have changed the article. As discussed in the Consistent estimator article, it is the sequence of estimators that converges in probability to the parameter. If only "a subsequence" did so, that would be a hideously weak result. Now, if it said "all subsequences", that would be different: that would imply that the sequence itself converges... but there's no reason to say it that way, AFAIK. - dcljr (talk) 00:49, 3 April 2013 (UTC)

## Error in the article: Consistency

Hey!

There is an error in the conditions that are demanded for the MLE to be consistent. The second condition states that the parameter space must be compact, which is not true. The parameter space must be an open subset of ${\displaystyle \mathbb {R} ^{p}}$, which implies completeness, but not compactness.

For example the parameter space of ${\displaystyle \sigma }$ in the normal distribution is ${\displaystyle (0,\infty )}$. That subset is open, and thus complete but not compact.

Repairing this would call for replacing the whole part about consistency. Suggested sources are Lehmann (1999) Elements of Large Sample Theory, or Knight (2000) Mathematical Statistics. — Preceding unsigned comment added by Emogstad (talkcontribs) 10:09, 25 May 2013 (UTC)

## Problem in "Higher-order properties" section

I believe that the bias calculation has a typo. The tensor ${\displaystyle J_{j,ik}}$ should actually be ${\displaystyle J_{i,jk}}$. The cited source is equation (20) from Cox and Snell (1968) (http://www.jstor.org/stable/2984505). This problem is there as well.

I checked two other sources, which comply with the correction I propose:

• Equation (16) from Shenton, L. R., & Wallington, P. A. (1962). The Bias of Moment Estimators with an Application to the Negative Binomial Distribution. Biometrika, 49(1/2), 193. http://doi.org/10.2307/2333481
• 7th equation in page 29 from Firth, D. (1993). Bias reduction of maximum likelihood estimates. Biometrika, 80(1), 27–38. http://doi.org/10.1093/biomet/80.1.27

I'm not an expert on this topic, but rather an advanced user. If somebody confirms this, I would make the changes. — Preceding unsigned comment added by Fbalzarotti (talkcontribs) 21:55, 28 February 2016 (UTC)

Wikipedia:Be bold (but not reckless): You've done some good homework. Make the change. I don't have time to check the math nor the sources you cite.
I suggest include a single footnote citing all three sources with a brief comment listing first Shelton and Wallington (1962) and Firth (1993) then noting that Cox and Snell (1968) write ${\displaystyle J_{j,ik}}$ where the other sources write ${\displaystyle J_{i,jk}}$, and you believe the other sources (because ...???)
Do you have an example where using ${\displaystyle J_{i,jk}}$ would give a different answer from using ${\displaystyle J_{j,ik}}$? If you do, then you might do a simulation to confirm -- and it would be worth mentioning that example in the text and your simulation in a footnote.
For the normal distribution, ${\displaystyle K_{ijk}}$ = 0 = ${\displaystyle J_{j,ik}}$, for all ${\displaystyle i}$, ${\displaystyle j}$, and ${\displaystyle k}$. For Poisson regression and logistic regression, ${\displaystyle J_{j,ik}}$ = 0; all the bias comes from ${\displaystyle K_{ijk}}$. These three examples don't expose the difference.
(I'm not an expert in this either.) DavidMCEddy (talk)
Thanks for the advice, I'll be bold and go ahead. I believe the others authors and not Cox because I went through the calculations myself, I'm quite confident given that I found the other articles confirming them.
As for adding an example, I'm not sure. The situation with which I'm working is a bit odd.
I'm using a set ${\displaystyle \{n_{1},...,n_{m}\}}$ of random variables following a multinomial distribution with parameters ${\displaystyle \{p_{1},...,p_{m}\}}$ where ${\displaystyle p_{m}=1-\sum _{i=0}^{m-1}p_{i}}$ and ${\displaystyle N=\sum _{i=0}^{m}n_{i}}$. When working with maximum likelihood estimators for the parameters ${\displaystyle p_{i}}$ the necessary tensors for the bias estimation are ${\displaystyle K_{ijk}=2N[\delta _{ijk}\,p_{i}^{-2}-p_{m}^{-2}]}$ and ${\displaystyle J_{j,ik}=N\,(1-2p_{m})\,p_{m}^{-2}}$, where the indices ${\displaystyle \{i,j,k\}}$ run from 1 to m-1. They're both symmetric in ${\displaystyle \{i,j,k\}}$ and the problem doesn't show. So far, so good.
I'm actually working in a situation in which the parameters ${\displaystyle \{p_{1},...,p_{m}\}}$ depend on a set of variables ${\displaystyle \{x_{1},...,x_{k}\}}$. I'm using a maximum likelihood estimator for these variables ${\displaystyle x_{i}}$ and I need to calculate an approximation of their bias. In this context the result for ${\displaystyle J_{j,ik}}$ is
${\displaystyle J_{j,ik}=NE\left[\left(\sum _{u=1}^{m}{\frac {\partial p_{u}}{\partial x_{j}}}\right)\left(\sum _{u=1}^{m}-{\frac {1}{p_{u}}}{\frac {\partial p_{u}}{\partial x_{i}}}{\frac {\partial p_{u}}{\partial x_{k}}}+{\frac {\partial ^{2}p_{u}}{\partial x_{i}\partial x_{k}}}\right)+\sum _{u=1}^{m}{\frac {\partial p_{u}}{\partial x_{j}}}\left(-{\frac {1}{p_{u}}}{\frac {\partial p_{u}}{\partial x_{i}}}{\frac {\partial p_{u}}{\partial x_{k}}}+{\frac {\partial ^{2}p_{u}}{\partial x_{i}\partial x_{k}}}\right)\,\left({\frac {1}{p_{u}}}-2\right)\right]}$
which is not symmetric. By interchanging ${\displaystyle j}$ by ${\displaystyle i}$ the result completely changes. In general, the functional dependence ${\displaystyle {\bar {p}}({\bar {x}})}$ makes this tensor non-zero and non-symmetric, which are the cases with which I'm working.
I have calculated and simulated all this, but I'm afraid that posting this whole thing as an example may not be exactly informative, but quite disruptive. Fbalzarotti (talk) 15:42, 29 February 2016 (UTC)
I'm impressed.
After you get your work published -- at least as a tech report on the web -- I might like to see a line added saying something like what I wrote above: "For the multivariate normal distribution, ${\displaystyle K_{ijk}}$ = 0 = ${\displaystyle J_{j,ik}}$, for all ${\displaystyle i}$, ${\displaystyle j}$, and ${\displaystyle k}$. For Poisson regression and logistic regression, ${\displaystyle J_{j,ik}}$ = 0; all the bias comes from ${\displaystyle K_{ijk}}$. For an example in which ${\displaystyle J_{j,ik}}$ is not zero, see" your tech report.
Wikipedia:Conflict of interest policy "strongly discourages" "contributing to Wikipedia about yourself, family, friends, clients, employers, or your financial or other relationships." For examples like this, they suggest you use the {{request edit}} template on a talk page like this: Provide the suggested change with a link to your paper, noting that it's your work. Then hope that someone else will make the actual change.
And good luck with your research. DavidMCEddy (talk) 17:40, 29 February 2016 (UTC)

## Move to hyphenated title

All the literature calls it "maximum likelihood" without a hyphen. Why have you moved it? It is not a compound adjective requiring hyphenation (we don't say this estimation is maximum-likelihood). It is a compound noun and therefore should not be hyphenated. Tayste (edits) 02:30, 20 June 2016 (UTC)

@Tayste: I moved maximum likelihood to maximum-likelihood estimation, because estimation was part of the acronym (MLE) mentioned in the lead. The hyphen was not intentional. I've now requested a technical move to maximum likelihood estimation. fgnievinski (talk) 02:50, 20 June 2016 (UTC)
Oh I see, thank you. Tayste (edits) 20:09, 20 June 2016 (UTC)