Fisher's noncentral hypergeometric distribution

From Wikipedia, the free encyclopedia
Jump to: navigation, search
Probability mass function for Fisher's noncentral hypergeometric distribution for different values of the odds ratio ω.
m1 = 80, m2 = 60, n = 100, ω = 0.01, ..., 1000

In probability theory and statistics, Fisher's noncentral hypergeometric distribution is a generalization of the hypergeometric distribution where sampling probabilities are modified by weight factors. Fisher's noncentral hypergeometric distribution can also be defined as the conditional distribution of two or more binomially distributed variables dependent upon their fixed sum.

The distribution may be illustrated by the following urn model. Assume, for example, that an urn contains m1 red balls and m2 white balls, totalling N = m1 + m2 balls. Each red ball has the weight ω1 and each white ball has the weight ω2. We will say that the odds ratio is ω = ω1 / ω2. Now we are taking balls randomly in such a way that the probability of taking a particular ball is proportional to its weight, but independent of what happens to the other balls. The number of balls taken of a particular color follows the binomial distribution. If the total number n of balls taken is known then the conditional distribution of the number of taken red balls for given n is Fisher's noncentral hypergeometric distribution. To generate this distribution experimentally, we have to repeat the experiment until it happens to give n balls.

If we want to fix the value of n prior to the experiment then we have to take the balls one by one until we have n balls. The balls are therefore no longer independent. This gives a slightly different distribution known as Wallenius' noncentral hypergeometric distribution. It is far from obvious why these two distributions are different. See the entry for noncentral hypergeometric distributions for an explanation of the difference between these two distributions and a discussion of which distribution to use in various situations.

The two distributions are both equal to the (central) hypergeometric distribution when the odds ratio is 1.

Unfortunately, both distributions are known in the literature as "the" noncentral hypergeometric distribution. It is important to be specific about which distribution is meant when using this name.

Fisher's noncentral hypergeometric distribution was first given the name extended hypergeometric distribution (Harkness, 1965), and some authors still use this name today.

Univariate distribution[edit]

Univariate Fisher's noncentral hypergeometric distribution
Parameters m_1, m_2 \in \mathbb{N}
N = m_1 + m_2
n \in [0,N)
\omega \in \mathbb{R}_+
Support x \in [x_\min,x_\max]
x_\min=\max(0,n-m_2)
x_\max=\min(n,m_1)
pmf \frac{\binom{m_1}{x} \binom{m_2}{n-x} \omega^x}{P_0}
where P_0 = \sum_{y=x_\min}^{x_\max} \binom{m_1}{y} \binom{m_2}{n-y} \omega^y
Mean \frac{P_1}{P_0}, where P_k = \sum_{y=x_\min}^{x_\max} \binom{m_1}{y} \binom{m_2}{n-y} \omega^y\, y^k
Mode \,\, \left\lfloor \frac{-2C}{B - \sqrt{B^2-4AC}} \right\rfloor \, , where A=\omega-1, B = m_1 + n - N -(m_1+n+2)\omega, C = (m_1+1)(n+1)\omega.
Variance \frac{P_2}{P_0} - \left( \frac{P_1}{P_0} \right)^2, where Pk is given above.

The probability function, mean and variance are given in the table to the right.

An alternative expression of the distribution has both the number of balls taken of each color and the number of balls not taken as random variables, whereby the expression for the probability becomes symmetric.

The calculation time for the probability function can be high when the sum in P0 has many terms. The calculation time can be reduced by calculating the terms in the sum recursively relative to the term for y = x and ignoring negligible terms in the tails (Liao and Rosen, 2001).

The mean can be approximated by:

\mu \approx \frac{-2c}{b - \sqrt{b^2-4ac}} \, ,

where a=\omega-1, b=m_1 + n - N -(m_1+n)\omega, c=m_1 n \omega.

The variance can be approximated by:

\sigma^2 \approx \frac{N}{N-1} \bigg/ \left( \frac{1}{\mu}+ \frac{1}{m_1-\mu}+ \frac{1}{n-\mu}+ \frac{1}{\mu+m_2-n} \right) .

Better approximations to the mean and variance are given by Levin (1984, 1990), McCullagh and Nelder (1989), Liao (1992), and Eisinga and Pelzer (2011). The saddlepoint methods to approximate the mean and the variance suggested Eisinga and Pelzer (2011) offer extremely accurate results.

Properties[edit]

The following symmetry relations apply:

\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(n-x;n,m_2,N,1/\omega)\,.
\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(x;m_1,n,N,\omega)\,.
\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(m_1-x;N-n,m_1,N,1/\omega)\,.

Recurrence relation:

\operatorname{fnchypg}(x;n,m_1,N,\omega) = \operatorname{fnchypg}(x-1;n,m_1,N,\omega) \frac{(m_1-x+1)(n-x+1)}{x(m_2-n+x)}\omega\,.


Recurrence relation[edit]

A Fisher hypergeometric distribution gives the distribution of the number of successes in n independent draws from a population of size n_{\text{tot}} containing n_{\text{succ}} successes with the odds ratio w.


\left\{w f(x) (x-n) (\text{nsucc}-x)-(x+1) f(x+1)
   (n+\text{nsucc}-\text{ntot}-x-1)=0,f(0)=\frac{1}{\,
   _2F_1(-n,-\text{nsucc};-n-\text{nsucc}+\text{ntot}+1;w)}\right\}

Multivariate distribution[edit]

Multivariate Fisher's Noncentral Hypergeometric Distribution
Parameters c \in \mathbb{N}
\mathbf{m}=(m_1,\ldots,m_c) \in \mathbb{N}^c
N = \sum_{i=1}^c m_i
n \in [0,N)
\boldsymbol{\omega} = (\omega_1,\ldots,\omega_c) \in \mathbb{R}_+^c
Support \mathrm{S} = \left\{ \mathbf{x} \in \mathbb{Z}_{0+}^c \, : \, \sum_{i=1}^{c} x_i = n \right\}
pmf \frac{1}{P_0}\prod_{i=1}^{c} \binom{m_i}{x_i}\omega_i^{x_i}
where P_0 = \sum_{(y_0,\ldots,y_c)\in \mathrm{S}}\prod_{i=1}^{c} \binom{m_i}{y_i}\omega_i^{y_i}
Mean The mean μi of xi can be approximated by
\mu_i = \frac{m_i r \omega_i}{r \omega_i + 1} where r is the unique positive solution to \sum_{i=1}^{c}\mu_i = n\,.

The distribution can be expanded to any number of colors c of balls in the urn. The multivariate distribution is used when there are more than two colors.

The probability function and a simple approximation to the mean are given to the right. Better approximations to the mean and variance are given by McCullagh and Nelder (1989).

Properties[edit]

The order of the colors is arbitrary so that any colors can be swapped.

The weights can be arbitrarily scaled:

\operatorname{mfnchypg}(\mathbf{x};n,\mathbf{m}, \boldsymbol{\omega}) = \operatorname{mfnchypg}(\mathbf{x};n,\mathbf{m}, r\boldsymbol{\omega})\,\, for all r \in \mathbb{R}_+.

Colors with zero number (mi = 0) or zero weight (ωi = 0) can be omitted from the equations.

Colors with the same weight can be joined:


\begin{align}
& {} \operatorname{mfnchypg}\left(\mathbf{x};n,\mathbf{m}, (\omega_1,\ldots,\omega_{c-1},\omega_{c-1})\right) \\
& {} = \operatorname{mfnchypg}\left((x_1,\ldots,x_{c-1}+x_c); n,(m_1,\ldots,m_{c-1}+m_c), (\omega_1,\ldots,\omega_{c-1})\right)\, \cdot \\
& \qquad \operatorname{hypg}(x_c; x_{c-1}+x_c, m_c, m_{c-1}+m_c)
\end{align}

where \operatorname{hypg}(x;n,m,N) is the (univariate, central) hypergeometric distribution probability.


Applications[edit]

Fisher's noncentral hypergeometric distribution is useful for models of biased sampling or biased selection where the individual items are sampled independently of each other with no competition. The bias or odds can be estimated from an experimental value of the mean. Use Wallenius' noncentral hypergeometric distribution instead if items are sampled one by one with competition.

Fisher's noncentral hypergeometric distribution is used mostly for tests in contingency tables where a conditional distribution for fixed margins is desired. This can be useful, for example, for testing or measuring the effect of a medicine. See McCullagh and Nelder (1989).

Software available[edit]

See also[edit]

References[edit]

Breslow, N. E.; Day, N. E. (1980), Statistical Methods in Cancer Research, Lyon: International Agency for Research on Cancer .

Eisinga, R.; Pelzer, B. (2011), Saddlepoint approximations to the mean and variance of the extended hypergeometric distribution, Statistica Neerlandica 65 (1): 22–31, doi:10.1111/j.1467-9574.2010.00468.x .

Fog, A. (2007), Random number theory .

Fog, A. (2008), Sampling Methods for Wallenius' and Fisher's Noncentral Hypergeometric Distributions, Communications in statictics, Simulation and Computation 37 (2): 241–257, doi:10.1080/03610910701790236 .

Johnson, N. L.; Kemp, A. W.; Kotz, S. (2005), Univariate Discrete Distributions, Hoboken, New Jersey: Wiley and Sons .

Levin, B. (1984), Simple Improvements on Cornfield's approximation to the mean of a noncentral Hypergeometric random variable, Biometrika 71 (3): 630–632, doi:10.1093/biomet/71.3.630 .

Levin, B. (1990), The saddlepoint correction in conditional logistic likelihood analysis, Biometrika 77 (2): 275–285, JSTOR 2336805 .

Liao, J. (1992), An Algorithm for the Mean and Variance of the Noncentral Hypergeometric Distribution, Biometrics 48 (3): 889–892, doi:10.2307/2532354, JSTOR 2532354 .

Liao, J. G.; Rosen, O. (2001), Fast and Stable Algorithms for Computing and Sampling from the Noncentral Hypergeometric Distribution, The American Statistician 55 (4): 366–369, doi:10.1198/000313001753272547 .

McCullagh, P.; Nelder, J. A. (1989), Generalized Linear Models, 2. ed., London: Chapman and Hall .