# Talk:Rule of succession

WikiProject Statistics (Rated C-class, Mid-importance)

This article is within the scope of the WikiProject Statistics, a collaborative effort to improve the coverage of statistics on Wikipedia. If you would like to participate, please visit the project page or join the discussion.

C  This article has been rated as C-Class on the quality scale.
Mid  This article has been rated as Mid-importance on the importance scale.

## the correct formula of Euler integral

Michael, the formula

$\int_0^1 p^s(1-p)^{n-s}\,dp={(s+1)!(n-s+1)! \over (n+2)!}$

should be corrected as

$\int_0^1 p^s(1-p)^{n-s}\,dp={(s)!(n-s)! \over (n+1)!}$

May I correct this for you? -- Jung dalglish 2 July 2005 08:35 (UTC)

Go ahead -- I'll check it closely later. Michael Hardy 3 July 2005 00:17 (UTC)

I changed it. - April 2006, after some idiot had reverted Jung's correction

## Notation

I'm studying this stuff in school and I didn't know the Cartesian Product symbol... Couldn't we put a side-bar in that lists the mathematical notation used and link it to the "Table of Mathematical Symbols" or something along those lines?

I think it would be helpful for math pages in general. Many other articles for different subjects have sidebars like that.

## Why does Laplace use the expected value (mean) of the posterior?

Why does Laplace use the expected value (mean) of the posterior? This is not obvious to me, but it's crucial. Could someone who knows add something to the entry about the justification for this? Alex Holcombe 11:13, 14 October 2007 (UTC)

When the parameter whose posterior probability distribution we're finding is itself a probability, then its expected value is the marginal probability, conditional on current data, of the event in question---in this case, tomorrow's sunrise. That is the content of the law of total probability. Michael Hardy 16:06, 14 October 2007 (UTC)
I think I'm interested in a different kind of answer. As I understand it, the expected value minimizes the *average* difference between the estimate and the actual unknown parameter value. But in other contexts, Bayesians may minimize the average sum of squared difference, and in maximum likelihood the mode of the posterior distribution is used. Did Laplace have any particular reason for choosing the mean? Is this somehow embedded in the law of total probability? Sorry if this is a naive question. Alex Holcombe 10:23, 15 October 2007 (UTC)
No, it does not. The expected value does not minimize that average distance. The median does. The expeected values minimizes the average square of the distance. Besides, it's not clear that estimation is what's being done here. Michael Hardy (talk) 19:07, 7 January 2008 (UTC)

(203.173.13.181 (talk) 14:08, 22 January 2011 (UTC)) I know this is a two year old question, but just to clarify Laplace does not set out to use the expected value of the posterior, rather he sets out to find the posterior predictive probability of a success in the next trial. It is just the mathematics of the problem that this calculation happens to yield the posterior expected value of p. The line straight after the equation

"....Since the conditional probability for success in the next experiment, given the value of p, is just p,...."

makes this clear, although this may have been inserted after you asked this question. (203.173.13.181 (talk) 14:08, 22 January 2011 (UTC))

## The probability that the sun will rise tomorrow

I want to drop that part of the article. It is already contained in the article about the sunrise problem, and it only furthers misunderstandings of the rule of succession - including the reference to "Bayesian philosophy".

The rule of succession is a mathematical theorem. Problems arise only when the conditions are ignored under which the theorem holds or in some cases, if certain issues should be treated as probabilities. -- Zz (talk) 13:50, 7 January 2008 (UTC)

Reverted.
Saying "the rule of succession is a theorem" and, therefore, that "the article should contain nothing but the theorem" is a cop-out. The rule of succession is a calculation. The important question is whether (or when) the calculation is appropriate.
It is useful to include the sunrise question here, because (i) it helps make the discussion much more concrete; (ii) it is the original connection in which Laplace presented the calculation.
Of course, the rule of succession has many other applications -- eg in symbol probability estimates for information coding, and/or more widely in probabilistic modelling on the back of rather limited quantities of data.
But the general principle, and the general controversy (or at least debatability of the appropriateness of such estimates/calculations) is well reflected in the sunrise problem.
If there is a separate article on the sunrise problem, then arguably it ought to be turned into a redirect here. Jheald (talk) 16:46, 15 January 2008 (UTC)
I wrote here. No objection came. I changed the article accordingly. You revert without waiting for the disccussion of your objection. That is not the best style.
The rule of succession is a theorem. It is a mathematical fact. It can be proved. Laplace did not grab it from thin air. As every theorem, it holds only if the conditions are fulfilled. For a reasonable discussion of the objections raised, see for instance Jaynes' book of probability theory. The examples there are illuminating.
As for the the "sunrise problem", there is no deeper connection other than it was an example, and a famously misleading one as that. Laplace saw no sunrise problem in the first place. Instead, he gave an example with a twofold purpose: 1. if it were probability, you would calculate it this way, 2. seeing the causal connection is much more rewarding than using the approach of probabnility. And most people missed the point. Anyway, if you think that the articles should be one, propose to merge them. -- Zz (talk) 13:31, 18 January 2008 (UTC)

## The prior if we had not known from the start that both success and failure are possible

What exactly does "if we had not known from the start that both success and failure are possible" mean? They are both surely always logical possibilities when executing a Boolean test procedure (along with the possibility of never halting with a decision, in general). Assuming a value is always produced, it seems to mean that at the start, as a prior, we believed that one or neither value may have exactly 0 probability - but surely that is taken as read, albeit no exact probability could ever be proved by any finite number of samples. So, instead, if worth saying and considering, it presumably means that the prior assumption is that all samples will have the same value - unless proven otherwise. But even if my final interpretation is correct - and especially otherwise - the intended meaning ought to be clarified.

Anyway, such an (initial) prior term, p, "when not even known whether success or failure is possible", seems dubious - to say the least. As stated in the article, it leads to p|x[n]|..|x[1] = s/n. However, consider the extreme experimental case where only successes or only failures occurred (after a finite number of samples and at least 1 sample). The resultant probability for future observations would then be 1 for that observed value and 0 for the other. But that would violate Cromwell's rule, which would be silly - as described on its Wiki page. No finite amount of new evidence could change such a conclusion (i.e., given 0 or 1 as a prior in Bayes' rule)! What if further samples showed other values too? OK, the s/n formula allows them to be incorporated, but how can that formula be valid when its derivation is based on Bayes' rule (at least in the proof in the article), in which a 0 prior can never change (or at best becomes undefined) as further samples are observed. Surely each logical possibility should have a non-zero prior, however small, and only ever approach zero, as more samples are tested. Furthermore, unlike (s+1)/(n+2), s/n is undefined before any samples have been made, which seems to state that the prior is undefined (0/0, not just "fuzzy"), thereby invalidating any probabilities derived form it, as samples are actually observed!

Conversely, if the number of different values that can actually occur is absolutely unknown, the assumption at each step (term in the product) should surely be the number of different result values observed so far (in the previous samples) plus any different values assumed in the prior. In this extreme case, with no upper limit to the number of different values, the prior assumption should surely be in the category "novelty", i.e., some value that has not yet been observed. As in Cromwell's rule: I beseech you, in the bowels of Christ, think it possible that you may be mistaken [that you have seen all possible values in the series]. If we then treat (map) the observation of a "novelty" as "success", we can still use the (s+1)/(n+2) formula as the probability that a new sample will produce a "novelty". In this extreme case, the prior probability of a "novelty" would then be (0+1)/(0+2) = 1/2, i.e., initially, a "novelty" would be assumed as likely as not in any new sample, which, naively, seems to be the correct, maximum entropy, assumption, given 2 possibilities and no other knowledge. (All possible "novelty" values are considered equivalent at this stage. So the first sample may (50%) turn out to be a "novelty"; fine. Otherwise (50%) it must be something that you had not thought of, but that, too, would surely be classed as a "novelty"! Plainly, whatever it is, it must be a "novelty"! Each later sample value would either have been previously observed, or would be straightforwardly a "novelty" - all possibilities now covered!) As with any other definition of "success", the probability of a "novelty" in a the next sample in the given sequence would become ever more accurately determined as more samples are observed.

If there were prior knowledge that the maximum of number of "novelty" values is two (labelled "success" and "failure", say), then once a sample had been observed, the future possibilities would be the same or a "novelty". If a second value were eventually observed, a "novelty" would no longer be a possibility in a future sample. Either way, whenever considering the next sample, the (s+1)/(n+2), rather than s/n formula would surely apply.

However you look at it, the s/n formula seems wrong! If not, further explanation is surely needed, to address my points above. Regrettably, these are only apparently "common sense" deductions from what I have read in Wiki; I know of no published source, so I cannot put the above argument in the article! I may well be wrong - either way, someone please clarify the article.

John Newbury (talk) 20:04, 4 July 2010 (UTC)

On reflection, more accurately, if the number of possibilities is totally unknown, and assuming an observation will always return a value, the prior probability before any observations is simply 1 for "novelty" - we know that there is no other possibility at that stage! Only after the first observation are there 2 possibilities, "novelty" or not-"novelty", i.e., a value already seen. There is no reason to favour one possibility over the other, so the probability for future observations then becomes 0.5 for "novelty" and 0.5 for not-"novelty". (Not certainty for "novelty", even though the number of novel possibilities is infinite, whereas there is only 1 non-"novelty" so far: we are always only concerned with 2 categories after there has been at least 1 observation.) So the probability after 1 observation is then just like the prior probability after 0 observations when it is known that there are exactly 2 possibilities. So the formula for a "novelty" when the number of possibilities is unknown becomes s/(n+1), i.e., we move the origin by 1, so count 1 fewer "novelties" and 1 fewer observations, than with formula (s+1)/(n+2).

This analysis avoids the dubious idea of "novelty" or "something not yet thought of" being equivalent "novelty" when there had been 0 observations! However, there is now discontinuity: the new formula says the probability of "novelty" when s=n=0 is 0,when we know it is 1. The reason for the discontinuity in the complete function is that before there were any observations, the only possibility was "novelty"; whereas thereafter, not-"novelty" is also a possibility.

Nevertheless, the article still seems wrong to use (s/n), albeit I now accept that (s+1)/(n+2) is no longer applicable in this case!

John Newbury (talk) 22:27, 10 July 2010 (UTC)

I think the author was intending to describe the prior when p can be anything in [0, 1] inclusive, as opposed to (0, 1) (exclusive). However in the continuous case these are exactly the same thing because the probability under a continuous pdf of any particular exact value is 0 anyway.
Anyway the formula s/n is definitely wrong; you shouldn't be able to get a more constrained posterior result (s/n = 0/n = 0) from a less constrained prior ("p can be anything in [0, 1]"). I suspect the 1/(p(1-p)) prior it's derived from. It can't be (as the article says) the prior of maximum ignorance because the only constraint is that p is definitely in [0, 1], which the uniform(0, 1) prior also satisfies, and the uniform prior has higher entropy.
So I'm inclined to just delete the section there - I'm sure the s/n formula is incorrect, and the prior it's derived from is extremely dubious.
203.28.244.254 (talk) 09:16, 21 October 2010 (UTC)

(203.173.13.181 (talk) 02:37, 19 January 2011 (UTC)) I can cite a reference saying that the Haldene prior (the improper "Beta(0,0)" prior ) is the appropriate way to express "ignorance" in that we are not even sure that it is possible for both success and failure have occurred (this is because the "odds" of success is a genuine scale parameter, which everyone agrees that its ignorance is expressed by the "jeffreys prior" p(O)=1/O, and the haldene prior is the appropriate transform of this prior). The link to the article is here http://www.stats.org.uk/priors/noninformative/Smith.pdf. But what happens is that the POSTERIOR distribution is not proper until at least one success and at least one failure has been observed (this gives somewhat a meaning to saying "the exception that proves the rule"). So the Expectation of the probability DOES NOT EXIST if only 1 event has been observed. I think this makes sense (the improper posterior) "logically speaking" because without knowledge that both success and failure are possible, we are running the risk of specifying a "logical contradiction" in that we have said a variable can have 2 values, while at the same time saying it can only have 1 value.

If we are sure that success and failure are possible, this can be seen as observing one success and one failure using the haldine prior, to give the uniform distribution! This was expressed in Jaynes Book "Probability Theory: The Logic of Science" (203.173.13.181 (talk) 02:37, 19 January 2011 (UTC))

(203.173.13.181 (talk) 13:49, 22 January 2011 (UTC)) I will adjust the text under the use of this prior to make it clear that the use of the improper haldene prior (1/(p(1-p)) or Beta(0,0)) gives an improper posterior when s=n or s=0. I will save a copy of the text prior to changing it, and paste it here, so that the changes can be reversed if people so desire.

PREVIOUS TEXT

However, it is not clear that this problem or result are meaningful. It certainly cannot apply to any realistic situation, because it violates Cromwell's rule. As soon as the first observation has been made, (n=1, s=0 or 1), the formula declares that the observed value is certain, so the other value is impossible, whereas the other value is surely still a possibility, by problem definition (as well as common sense), since we have no prior knowledge that either value is not possible. We can never know that for certain after any finite number of observations, although we might eventually know that one or both are possible. Then again, the problem explicitly asserts that we do not know that either value is possible: a contradiction! Of course, if neither value were possible (albeit we would not know it), no observation could be completed, but we would wait for ever for a result, as in Turing's halting problem. Alternatively, the observation may eventually time-out, say, and indicate "no value", but then we have a three possibilities, contrary to the analysis that produced the formula!

MY REPLACEMENT Thus, with the prior specifying total ignorance, the probability of success is governed by the observed frequency of success. However, the posterior distribution which led to this result is the Beta(s,n-s) distribution, which will not be proper when s=n or s=0 (i.e. the normalisation constant is infinite when s=0 or s=n). This means that we cannot use the posterior distribution to calculate the probability of the next observation being a success when s=0 or s=n. However, we can use this improper posterior as an improper prior to be updated by future observations (however the posterior will continue to be improper until at least one success and at least one failure have been observed). This puts the information contained in the rule of succession in greater light: it can be thought of as expressing the prior assumption that if sampling was continued indefinitely, we would eventually observe at least one success, and at least one failure in the sample. The prior expressing total ignorance does not assume this knowledge. (203.173.13.181 (talk) 13:49, 22 January 2011 (UTC)) (203.173.13.181 (talk) 13:49, 22 January 2011 (UTC))

(58.6.95.232 (talk) 12:15, 7 February 2011 (UTC)) On having a closer read of the "further analysis" section, it has a lot of "waffle" and unnecessary remarks. While they are interesting remarks about setting prior probabilities, they are not relevant to the analysis of the rule of succession. So I have largely removed these. The old text is pasted below

OLD VERSION

In the absence of prior knowledge to the contrary, the sum of the pseudocounts should be two, however many possibilities exists, even if there is prior knowledge about their relative values (i.e., prior probabilities) – see generalizing to any number of possibilities, above. Therefore, even if there is some prior knowledge, the sum of any estimated pseudocounts must never be less than two, otherwise the weight given to the prior knowledge would actually decrease, which would be nonsense. This is a useful plausibility check for pseudocounts.

Unless there is prior knowledge that suggests the prior probabilities should be given extra weight in the face of contrary evidence from actual observations, then the sum of the pseudocounts should always be two. For instance, the total weight of prior knowledge should be unaffected by discovering that "success" is due to throwing a six on a die toss, rather than a head on a coin toss; this should only affect the expected relative probability of success versus failure. Also, no extra weight is implied by prior knowledge of how it may be usefully described or modelled, which boils down to how its possibilities may usefully be split or grouped, depending on what we[who?] are interested in, such as whether we include the angle of the die face, or the position on the ground (normally quantized for simplicity). Although the prior probabilities of "high-level" possibilities that we may be interested in, such as "success", may depend on the number of their component possibilities to which we feel indifferent (i.e., have no reason to expect one more than another, by principle of indifference), the total weight of the prior knowledge (sum of pseudocounts) compared to that of actual observations (one per observation) should obviously be unchanged by any change of how we choose to group the possibilities or otherwise represent or model it.

However, inasmuch there is also prior knowledge that the system should have extra weight compared to actual observations, then the sum of the pseudocounts should always exceed two, scaled up (never down) according to how much weight we feel, often subjectively, should be given to the prior knowledge, compared to actual observations. This should be due to knowledge, however vague and subjective, of earlier observations of similar systems, prepared in a similar manner, as opposed to a theoretical analysis of a model, whether and however accurate and appropriate. For instance, consider that we know that "success" is equivalent to throwing a head on a random coin, obtained from a bank in the normal way, with little chance of tampering by anyone who may know or suspect the intended use of the coin. Our prior knowledge of such coins would indicate that very high, equal pseudocounts should be assigned to head and tail, thereby ensuring that even after a run of ten heads and no tails, say, we would still estimate the probability of the next throw being a head would be just over 1/2; rather than 11/12 if we knew nothing of the how the system had been prepared, but only the possibilities, "success" and "failure". (Certainly – if we can use such a word — not now more probably a tail, according to the fallacious law of averages.) "Preparation" includes not only the supposed physical model (in this case, including a thin uniform disc, with sides effectively labeled "head" and "tail"), but also the expectation that the model is realistic. Our prior knowledge, especially our choice of model and the expectation that it is appropriate, may be quite different if, instead, the coin were obtained from a man in a pub who had offered to bet on it, let alone if from a conjuror. Note that the expectation of head after those observations should not change if we now choose to divide head and tail into many component possibilities, such as angle of rotation or position on the ground - mere arbitrary nominal possibilities, which should not further increase the total weight of the pseudocounts compared to those of actual observations! The sum of the pseudocounts should not be too high, however, since there is always some non-zero probability of bias, which sufficient observations should be allowed to show.

In principle (see Cromwell's rule), no possibility should have its probability (or its pseudocount) set to zero, since nothing in the physical world should be assumed to be strictly impossible (though it may be) – even if it would be contrary to all observations to date and current theories. Indeed, Bayes rule takes absolutely no account of an observation that was previously believed to have zero probability — it is still declared impossible. However, most possibilities (physical or nominal), in most physical systems, must be ignored, to be tractable, to avoid an explosion of possibilities (see the frame problem). In which case, there should normally be a "miscellaneous", catch-all possibility, which should have a non-zero pseudocount. This could be sub-divided later, if need be. Alternatively, if all such possibilities would be uninteresting, the problem could be redefined to classify them as spurious outliers, to be ignored – such as a coin landing on its edge. Systems may sometimes usefully be studied where the miscellaneous or ignored outlier possibilities are expected to be the most common occurrence – such as the expected state of any needles that may be found (rather than the expected success in finding any), when searching a proverbial haystack, ignoring any hay that might be found.

However, it is sometimes debateable whether apparently prior knowledge should only affect the relative probabilities, or also the total weight of the prior knowledge compared to actual observations. In principle, knowledge only of the internal structure of a system should not increase the total weight due to prior knowledge, only the relative prior probabilities, such as knowing that "success" is due to a six on a die toss, rather than a head on a coin toss. But consider the probability of a draw in chess, which is too complex to fully analyse, given bounded rationality, of the analyst and the players concerned. Initially, we might naively assume that the prior probability is 1/2, since the only other possibility is non-draw. However, further analysis might show that the majority of final-position possibilities were non-draw, thereby increasing its relative prior probability in proportion. Yet more analysis might show that the majority of game-lines lead to draws (depending on the assumed ability of the players). Similarly, maybe more possible (or plausible; worth considering) plays on an average turn lead to a draw (again depending on the assumed ability of the players). Finally, actual games may have yet different probabilities. Even though the game is well defined, the prior probabilities cannot be determined, in practice, without some real-world knowledge: the expected "abilities" of the players – itself a vague concept. In the absence of other knowledge, the best assumption is probably that the players would have similar abilities to the analyst, which is effectively knowledge of a sample of one player (not one game), however vaguely relevant. So, in this case, a sum of pseudocounts greater than two would be appropriate, further amplified by any real-world knowledge of other games.

Similarly, if physically difficult to analyse, such as if a die were a known cuboid, rather than a cube. There may be a continuum of possibilities and observations. Often, at least in practice, a numerical method, such as a Monte Carlo method, may be required. In which case, we are back to processing discrete samples of discrete possibilities, albeit of the model, which will generally only be an approximation of the actual system. To allow for likely mismatch between model and reality, we should therefore limit the sum of the prior pseudocounts, no matter how many simulated samples were observed.

(58.6.95.232 (talk) 12:15, 7 February 2011 (UTC))

## Generalization to any number of possibilities

The logic behind this generalisation is not satisfactory, and there are no appropriate references where a more formal argument is made. I can provide a reference to a more formal generalisation given in PROBABILITY THEORY: THE LOGIC OF SCIENCE. page 563-568 discusses the original rule of succession, and page 568-571 discuss the generalisation to m categories.

One simple way to check the logical consistency of the results, is that if we observe nothing, the rule of succession reverts back to the principle of indifference (i.e. initial probability = 1/2 before any event is observed). Therefore, this principle should be maintained when generalising in the way described, but it clearly does not (unless m=2) for we get a prior probability for success of

$P(s=0, n=0, m)={0+{2 \over m} \over 0+2}={1 \over m}$

But this is clearly not expressing prior ignorance about a success unless m=2, for if m=4 say, this means that, prior to observing any data, we assume that it is 3 times as likely for failure as for success (Pr(success)=1/4). This easily becomes absurdly informative if the number of categories is huge (say m=10500 categories then this method assumes that the probability of an arbitrary half of these categories occurring is 10-500~0 which is hardly "...assuming absolutely no prior knowledge about them, except the number of them, and that any possibility being considered is from that set...." stated at the beginning of the section).

The way to proceed from here is very carefully, and to re-derive the results from first principles (this is what Jaynes and Bretthorst do in their book) rather than to introduce an intuitively sensible generalisation.

Setting a uniform prior over the initial m categories, and using the multinomial distribution (the multivariate generalisation of the binomial).

Letting Ai denote the event that the next observation is in category i (i=1,...,m), and let n_{i} denote the number of times category i (i=1,...,m) was observed and let n=n1+...+nm be the total number of observations made. The result, in terms of the original m categories is:

$P(A_{i} | n_{1},...,n_{m}, m)={n_{i} + 1 \over n + m}$

Because the propositions or events Ai are mutually exclusive, to collapse to two categories you simply add up the i values which correspond to "success" to get the probability of success. Supposing that you do this, and that you aggregate c categories as "success" and m-c categories as "failure". Let s denote the sum of the relevant n_{i} values which have been termed "success". The probability of "success" at the next trial is then:

$P(success| n_{1},...,n_{m}, m, c)={s + c \over n + m}$

Which is different from the intuitive rule 2/m unless 2/m=c and m=2 which just gives the binary rule of succession.

This indicates that mere knowledge of more than two outcomes which we are sure are possible (i.e. we are sure they would occur if we sampled indefinitely) is relevant information when collapsing these categories down into just two. In "shrinkage" terminology, the more categories that are enumerated (i.e. the larger m is), the more the probability of success is "shrunk" towards the value c/m (which is 1/2 in the example you give).

I believe something analogous to a "law of large numbers" style process occurring here, in that as you aggregate a larger and larger number of categories, the variability of the aggregation tends to decrease, and thus a priori one is more confident that the true proportion will actually be c/m. —Preceding unsigned comment added by 124.168.210.217 (talk) 16:20, 22 January 2011 (UTC)

(124.168.210.217 (talk) 16:29, 22 January 2011 (UTC))

apologies I forgot to sign the above dispute, and my name is Peter for those who wish to address me by my name. I wanted to propose that the calculations above replace the existing generalisation. I will replace them, unless I hear from some-one questioning the validity of the proposed arguments about the inferiority of the current generalisation and the appropriateness of the new one.

Cheers, Peter (124.168.210.217 (talk) 16:29, 22 January 2011 (UTC))

(58.6.92.1 (talk) 14:12, 6 February 2011 (UTC)) In changing the generalisation, I must also change the "further analysis" section, which talks a lot about why the pseudocounts must be equal to 2 in the denominator (which it doesn't). I will try to keep the general tone of this section, but change it so that it remains consistent with the rest of the article. I will paste the text here, so that it can be changed back if desired. My main change in tone when speaking of generalisations will be to emphasis the subtleness of describing what prior information you actually have, and how these results can affect the conclusions. I will also give a brief discussion of how the prior information could be slightly modified to give a slightly different generalisation (which includes the previous $\frac{2}{m}$ as a special case). (58.6.92.1 (talk) 14:12, 6 February 2011 (UTC))