= Qualitative variation =

An index of qualitative variation (IQV) is a measure of statistical dispersion in nominal distributions. Examples include the variation ratio or the information entropy.

==Properties==

There are several types of indices used for the analysis of nominal data. Several are standard statistics that are used elsewhere - range, standard deviation, variance, mean deviation, coefficient of variation, median absolute deviation, interquartile range and quartile deviation.

In addition to these several statistics have been developed with nominal data in mind. A number have been summarized and devised by Wilcox , , who requires the following standardization properties to be satisfied:
- Variation varies between 0 and 1.
- Variation is 0 if and only if all cases belong to a single category.
- Variation is 1 if and only if cases are evenly divided across all categories.

In particular, the value of these standardized indices does not depend on the number of categories or number of samples.

For any index, the closer to uniform the distribution, the larger the variance, and the larger the differences in frequencies across categories, the smaller the variance.

Indices of qualitative variation are then analogous to information entropy, which is minimized when all cases belong to a single category and maximized in a uniform distribution. Indeed, information entropy can be used as an index of qualitative variation.

One characterization of a particular index of qualitative variation (IQV) is as a ratio of observed differences to maximum differences.

==Wilcox's indexes==

Wilcox gives a number of formulae for various indices of QV , the first, which he designates DM for "Deviation from the Mode", is a standardized form of the variation ratio, and is analogous to variance as deviation from the mean.

===ModVR===

The formula for the variation around the mode (ModVR) is derived as follows:

$M = \sum_{i = 1}^K ( f_m - f_i )$

where f_{m} is the modal frequency, K is the number of categories and f_{i} is the frequency of the i^{th} group.

This can be simplified to

$M = Kf_m - N$

where N is the total size of the sample.

Freeman's index (or variation ratio) is

 $v = 1 - \frac{ f_m }{ N }$

This is related to M as follows:

$\frac{ ( \frac{ f_m }{ N } ) - \frac 1 K }{ \frac N K \frac{ ( K - 1 )} N } = \frac M { N( K - 1 ) }$

The ModVR is defined as

$\operatorname{ModVR} = 1 - \frac{ Kf_m - N }{ N( K - 1 ) } = \frac{ K( N - f_m ) }{ N ( K - 1 ) } = \frac{ K v }{ K - 1 }$

where v is Freeman's index.

Low values of ModVR correspond to small amount of variation and high values to larger amounts of variation.

When K is large, ModVR is approximately equal to Freeman's index v.

===RanVR===

This is based on the range around the mode. It is defined to be

 $\operatorname{RanVR} = 1 - \frac{ f_m - f_l }{ f_m } = \frac{ f_l }{ f_m }$

where f_{m} is the modal frequency and f_{l} is the lowest frequency.

===AvDev===

This is an analog of the mean deviation. It is defined as the arithmetic mean of the absolute differences of each value from the mean.

 $\operatorname{AvDev} = 1 - \frac 1 {2N} \frac K {K - 1} \sum^K_{i = 1} \left| f_i - \frac N K \right|$

===MNDif===

This is an analog of the mean difference - the average of the differences of all the possible pairs of variate values, taken regardless of sign. The mean difference differs from the mean and standard deviation because it is dependent on the spread of the variate values among themselves and not on the deviations from some central value.

$\operatorname{MNDif} = 1 - \frac 1 { N( K - 1 ) } \sum_{i = 1}^{K - 1} \sum_{ j = i + 1 }^K | f_i - f_j |$

where f_{i} and f_{j} are the i^{th} and j^{th} frequencies respectively.

The MNDif is the Gini coefficient applied to qualitative data.

===VarNC===

This is an analog of the variance.

$\operatorname{VarNC} = 1 - \frac 1 {N^2} \frac K {K - 1} \sum \left( f_i - \frac N K \right)^2$

It is the same index as Mueller and Schussler's Index of Qualitative Variation and Gibbs' M2 index.

It is distributed as a chi square variable with K – 1 degrees of freedom.

===StDev===

Wilson has suggested two versions of this statistic.

The first is based on AvDev.

 $\operatorname{StDev}_1 = 1 - \sqrt{ \frac{ \sum_{ i = 1 }^K \left( f_i - \frac N K \right)^2 }{ \left( N - \frac N K \right)^2 + ( K - 1 ) \left( \frac N K \right)^2 } }$

The second is based on MNDif

 $\operatorname{StDev}_2 = 1 - \sqrt{ \frac{ \sum^{K - 1}_{i = 1} \sum^K_{j = i + 1 } ( f_i - f_j )^2 }{ N^2 ( K - 1 )} }$

===HRel===

This index was originally developed by Claude Shannon for use in specifying the properties of communication channels.

 $\operatorname{HRel} = \frac{ - \sum p_i \log_2 p_i }{ \log_2 K }$

where p_{i} = f_{i} / N.

This is equivalent to information entropy divided by the $\log_2(K)$ and is useful for comparing relative variation between frequency tables of multiple sizes.

===B index===

Wilcox adapted a proposal of Kaiser based on the geometric mean and created the B index. The B index is defined as

 $B = 1 - \sqrt{ 1 - \left[ \sqrt[k] { \prod_{ i = 1 }^k \frac{ f_i K }{ N } } \, \right]^2 }$

===R packages===

Several of these indices have been implemented in the R language.

==Gibb's indices and related formulae==

 proposed six indexes.

===M1===

The unstandardized index (M1) is

 $M1 = 1 - \sum_{ i = 1 }^K p_i^2$

where K is the number of categories and $p_i = f_i / N$ is the proportion of observations that fall in a given category i.

M1 can be interpreted as one minus the likelihood that a random pair of samples will belong to the same category, so this formula for IQV is a standardized likelihood of a random pair falling in the same category. This index has also referred to as the index of differentiation, the index of sustenance differentiation and the geographical differentiation index depending on the context it has been used in.

===M2===

A second index is the M2 is:

 $M2 = \frac{ K }{ K - 1 } \left( 1 - \sum_{ i = 1 }^K p_i^2 \right)$

where K is the number of categories and $p_i = f_i / N$ is the proportion of observations that fall in a given category i. The factor of $\frac{ K }{ K - 1 }$ is for standardization.

M1 and M2 can be interpreted in terms of variance of a multinomial distribution (there called an "expanded binomial model"). M1 is the variance of the multinomial distribution and M2 is the ratio of the variance of the multinomial distribution to the variance of a binomial distribution.

===M4===

The M4 index is

 $M4 = \frac{ \sum_{ i = 1 }^K | X_i - m | }{ 2 \sum_{ i = 1 }^K X_i }$

where m is the mean.

===M6===

The formula for M6 is

 $M6 = K \left[ 1 - \frac{ \sum_{ i = 1 }^K | X_i - m | }{ 2 N } \right]$
·
where K is the number of categories, X_{i} is the number of data points in the i^{th} category, N is the total number of data points, || is the absolute value (modulus) and

 $m = \frac{ \sum_{ i = 1 }^K X_i }{ N }$

This formula can be simplified

 $M6 = K\left[ 1 - \frac{ \sum_{ i = 1 }^K \left| p_i - \frac 1 N \right| } 2 \right]$

where p_{i} is the proportion of the sample in the i^{th} category.

In practice M1 and M6 tend to be highly correlated which militates against their combined use.

===Related indices===

The sum

 $\sum_{ i = 1 }^K p_i^2$

has also found application. This is known as the Simpson index in ecology and as the Herfindahl index or the Herfindahl-Hirschman index (HHI) in economics. A variant of this is known as the Hunter–Gaston index in microbiology

In linguistics and cryptanalysis this sum is known as the repeat rate. The incidence of coincidence (IC) is an unbiased estimator of this statistic

 $\operatorname{IC} = \sum \frac{ f_i ( f_i - 1 ) }{ n ( n - 1 ) }$

where f_{i} is the count of the i^{th} grapheme in the text and n is the total number of graphemes in the text.

;M1

The M1 statistic defined above has been proposed several times in a number of different settings under a variety of names. These include Gini's index of mutability, Simpson's measure of diversity, Bachi's index of linguistic homogeneity, Mueller and Schuessler's index of qualitative variation, Gibbs and Martin's index of industry diversification, Lieberson's index. and Blau's index in sociology, psychology and management studies. The formulation of all these indices are identical.

Simpson's D is defined as

 $D = 1 - \sum_{i = 1}^K { \frac{ n_i ( n_i - 1 ) }{ n( n - 1 ) } }$

where n is the total sample size and n_{i} is the number of items in the i^{th} category.

For large n we have

 $u \sim 1 - \sum_{ i = 1 }^K p_i^2$

Another statistic that has been proposed is the coefficient of unalikeability which ranges between 0 and 1.

 $u = \frac{ c( x, y ) }{ n^2 - n }$

where n is the sample size and c(x,y) = 1 if x and y are unalike and 0 otherwise.

For large n we have

 $u \sim 1 - \sum_{ i = 1 }^K p_i^2$

where K is the number of categories.

Another related statistic is the quadratic entropy

 $H^2 = 2 \left( 1 - \sum_{ i = 1 }^K p_i^2 \right)$

which is itself related to the Gini index.

;M2

Greenberg's monolingual non weighted index of linguistic diversity is the M2 statistic defined above.

;M7

Another index – the M7 – was created based on the M4 index of

$M7 = \frac{ \sum_{ i = 1 }^K \sum_{ j = 1 }^L | R_i - R | }{ 2 \sum R_i }$

where

$R_{ ij } = \frac{ O_{ ij } } { E_{ ij } } = \frac{ O_{ ij } }{ n_i p_j }$

and

$R = \frac{ \sum_{ i = 1 }^K \sum_{ j = 1 }^L R_{ ij } }{ \sum_{ i = 1 }^K n_i }$

where K is the number of categories, L is the number of subtypes, O_{ij} and E_{ij} are the number observed and expected respectively of subtype j in the i^{th} category, n_{i} is the number in the i^{th} category and p_{j} is the proportion of subtype j in the complete sample.

Note: This index was designed to measure women's participation in the work place: the two subtypes it was developed for were male and female.

==Other single sample indices==

These indices are summary statistics of the variation within the sample.

===Berger–Parker index===

The Berger–Parker index, named after Wolfgang H. Berger and Frances Lawrence Parker, equals the maximum $p_i$ value in the dataset, i.e. the proportional abundance of the most abundant type. This corresponds to the weighted generalized mean of the $p_i$ values when q approaches infinity, and hence equals the inverse of true diversity of order infinity (1/^{∞}D).

===Brillouin index of diversity===

This index is strictly applicable only to entire populations rather than to finite samples. It is defined as

 $I_B = \frac{ \log( N! ) - \sum_{ i = 1 }^K ( \log( n_i! ) ) }{ N }$

where N is total number of individuals in the population, n_{i} is the number of individuals in the i^{th} category and N! is the factorial of N.
Brillouin's index of evenness is defined as

$E_B = I_B / I_{B( \max )}$

where I_{B(max)} is the maximum value of I_{B}.

===Hill's diversity numbers===

Hill suggested a family of diversity numbers

$N_a = \frac{1}{ \left[ \sum_{ i = 1 }^K p_i^a \right]^{ a - 1 } }$

For given values of a, several of the other indices can be computed

- a = 0: N_{a} = species richness
- a = 1: N_{a} = Shannon's index
- a = 2: N_{a} = 1/Simpson's index (without the small sample correction)
- a = 3: N_{a} = 1/Berger–Parker index

Hill also suggested a family of evenness measures

 $E_{ a, b } = \frac{ N_a }{ N_b }$

where a > b.

Hill's E_{4} is

 $E_4 = \frac{ N_2 } { N_1 }$

Hill's E_{5} is

 $E_5 = \frac{ N_2 - 1 } { N_1 - 1 }$

===Margalef's index===

 $I_\text{Marg} = \frac{ S - 1 } { \log_e N}$

where S is the number of data types in the sample and N is the total size of the sample.

===Menhinick's index===

$I_\mathrm{Men} = \frac{ S }{ \sqrt{ N } }$

where S is the number of data types in the sample and N is the total size of the sample.

In linguistics this index is the identical with the Kuraszkiewicz index (Guiard index) where S is the number of distinct words (types) and N is the total number of words (tokens) in the text being examined. This index can be derived as a special case of the Generalised Torquist function.

===Q statistic===

This is a statistic invented by Kempton and Taylor. and involves the quartiles of the sample. It is defined as

$Q = \frac{ \frac{ 1 }{ 2 } ( n_{ R1 } + n_{ R2 } ) + \sum_{ j = R_1 + 1 }^{ R_2 - 1 } n_j } { \log( R_2 / R_1 ) }$

where R_{1} and R_{2} are the 25% and 75% quartiles respectively on the cumulative species curve, n_{j} is the number of species in the j_{th} category, n_{Ri} is the number of species in the class where R_{i} falls (i = 1 or 2).

===Shannon–Wiener index===

This is taken from information theory

 $H = \log_e N - \frac{ 1 }{ N } \sum n_i p_i \log( p_i )$

where N is the total number in the sample and p_{i} is the proportion in the i^{th} category.

In ecology where this index is commonly used, H usually lies between 1.5 and 3.5 and only rarely exceeds 4.0.

An approximate formula for the standard deviation (SD) of H is

 $\operatorname{SD}( H ) = \frac{ 1 }{ N } \left[ \sum p_i [ \log_e( p_i ) ]^2 - H^2 \right]$

where p_{i} is the proportion made up by the i^{th} category and N is the total in the sample.

A more accurate approximate value of the variance of H(var(H)) is given by

 $\operatorname{var}(H) = \frac{ \sum p_i [\log(p_i)]^2 - \left[ \sum p_i \log( p_i ) \right]^2 } N + \frac{K - 1}{2N^2} + \frac{ -1 + \sum p_i^2 - \sum p_i^{-1} \log(p_i) + \sum p_i^{-1} \sum p_i \log(p_i) }{6N^3}$

where N is the sample size and K is the number of categories.

A related index is the Pielou J defined as

$J = \frac{ H } {\log_e( S ) }$

One difficulty with this index is that S is unknown for a finite sample. In practice S is usually set to the maximum present in any category in the sample.

===Rényi entropy===

The Rényi entropy is a generalization of the Shannon entropy to other values of q than unity. It can be expressed:

${}^qH = \frac{ 1 }{ 1 - q } \; \ln\left ( \sum_{ i = 1 }^K p_i^q \right )$

which equals

${}^qH = \ln\left ( { 1 \over \sqrt[ q - 1 ]{ H_\max - H_\min }$

where

 $X = \sum x_{ ij }$

 $X = \sum x_{ kj }$

 $H( X ) = \sum \frac{ x_{ ij } }{ X } \log \frac{ X }{ x_{ ij } }$

 $H( Y ) = \sum \frac{ x_{ kj } }{ Y } \log \frac{ Y }{ x_{ kj } }$

 $H_\min = \frac{ X }{ X + Y } H( X ) + \frac{ Y }{ X + Y } H( Y )$

 $H_\max = \sum \left( \frac{ x_{ ij } }{ X + Y } \log \frac{ X + Y }{ x_{ ij } } + \frac{ x_{ kj }}{ X + Y } \log \frac{ X + Y }{ x_{ kj } } \right)$

 $H_\mathrm{obs} = \sum \frac{ x_{ ij } + x_{ kj } }{ X + Y } \log \frac{ X + Y }{ x_{ ij } + x_{ kj } }$

In these equations x_{ij} and x_{kj} are the number of times the j^{th} data type appears in the i^{th} or k^{th} sample respectively.

===Rarefaction index===

In a rarefied sample a random subsample n in chosen from the total N items. In this sample some groups may be necessarily absent from this subsample. Let $X_n$ be the number of groups still present in the subsample of n items. $X_n$ is less than K the number of categories whenever at least one group is missing from this subsample.

The rarefaction curve, $f_n$ is defined as:

 $f_n = \operatorname E[ X_n ] = K - \binom{ N }{ n }^{ -1 } \sum_{ i = 1 }^K \binom{ N - N_i }{ n }$

Note that 0 ≤ f(n) ≤ K.

Furthermore,

 $f(0) = 0,\ f(1) = 1,\ f(N) = K.$

Despite being defined at discrete values of n, these curves are most frequently displayed as continuous functions.

This index is discussed further in Rarefaction (ecology).

===Caswell's V===

This is a z type statistic based on Shannon's entropy.

 $V = \frac{ H - \operatorname E( H ) }{ \operatorname{SD}( H ) }$

where H is the Shannon entropy, E(H) is the expected Shannon entropy for a neutral model of distribution and SD(H) is the standard deviation of the entropy. The standard deviation is estimated from the formula derived by Pielou

 $SD( H ) = \frac{ 1 }{ N } \left[ \sum p_i [ \log_e( p_i ) ]^2 - H^2 \right]$

where p_{i} is the proportion made up by the i^{th} category and N is the total in the sample.

===Lloyd & Ghelardi's index===

This is

 $I_{ LG } = \frac{ K }{ K' }$

where K is the number of categories and K is the number of categories according to MacArthur's broken stick model yielding the observed diversity.

===Average taxonomic distinctness index===

This index is used to compare the relationship between hosts and their parasites. It incorporates information about the phylogenetic relationship amongst the host species.

 $S_{TD} = 2 \frac{ \sum \sum_{ i < j } \omega_{ ij } }{ s( s - 1 ) }$

where s is the number of host species used by a parasite and ω_{ij} is the taxonomic distinctness between host species i and j.

===Index of qualitative variation===

Several indices with this name have been proposed.

One of these is

 $IQV = \frac{ K ( 100^2 - \sum_{ i = 1 }^K p_i^2 ) }{ 100^2 ( K - 1 ) } = \frac{ K }{ K - 1 } ( 1 - \sum_{ i = 1 }^K ( p_i / 100 )^2 )$

where K is the number of categories and p_{i} is the proportion of the sample that lies in the i^{th} category.

===Theil's H===

This index is also known as the multigroup entropy index or the information theory index. It was proposed by Theil in 1972. The index is a weighted average of the samples entropy.

Let

 $E_a = \sum_{i = 1}^a p_i log ( p_i )$

and

$H = \sum_{ i = 1 }^r \frac{ n_i ( E - E_i ) }{ NE }$

where p_{i} is the proportion of type i in the a^{th} sample, r is the total number of samples, n_{i} is the size of the i^{th} sample, N is the size of the population from which the samples were obtained and E is the entropy of the population.

==Indices for comparison of two or more data types within a single sample==

Several of these indexes have been developed to document the degree to which different data types of interest may coexist within a geographic area.

===Index of dissimilarity===

Let A and B be two types of data item. Then the index of dissimilarity is

$D = \frac{ 1 }{ 2 } \sum_{ i = 1 }^K \left| \frac{ A_i }{ A } - \frac{ B_i }{ B } \right|$

where

$A = \sum_{ i = 1 }^K A_i$

$B = \sum_{ i = 1 }^K B_i$

A_{i} is the number of data type A at sample site i, B_{i} is the number of data type B at sample site i, K is the number of sites sampled and || is the absolute value.

This index is probably better known as the index of dissimilarity (D). It is closely related to the Gini index.

This index is biased as its expectation under a uniform distribution is > 0.

A modification of this index has been proposed by Gorard and Taylor. Their index (GT) is

 $GT = D \left( 1 - \frac{ A }{ A + B } \right)$

===Index of segregation===

The index of segregation (IS) is

$SI = \frac{ 1 }{ 2 }\sum_{ i = 1 }^K \left| \frac{ A_i }{ A } - \frac{ t_i - A_i }{ T - A } \right|$

where

$A = \sum_{ i = 1 }^K A_i$

$T = \sum_{ i = 1 }^K t_i$

and K is the number of units, A_{i} and t_{i} is the number of data type A in unit i and the total number of all data types in unit i.

===Hutchen's square root index===

This index (H) is defined as

 $H = 1 - \sum_{ i = 1}^K \sum_{ j = 1 }^i \sqrt{ p_i p_j }$

where p_{i} is the proportion of the sample composed of the i^{th} variate.

===Lieberson's isolation index===

This index ( L_{xy} ) was invented by Lieberson in 1981.

 $L_{xy} = \frac 1 N \sum_{i = 1}^K \frac{ X_i Y_i }{ X_\mathrm{tot} }$

where X_{i} and Y_{i} are the variables of interest at the i^{th} site, K is the number of sites examined and X_{tot} is the total number of variate of type X in the study.

===Bell's index===

This index is defined as

 $I_R = \frac{ p_{ xx } - p_x } { 1 - p_x }$

where p_{x} is the proportion of the sample made up of variates of type X and

 $p_{xx} = \frac{ \sum_{ i = 1 }^K x_i p_i }{ N_x }$

where N_{x} is the total number of variates of type X in the study, K is the number of samples in the study and x_{i} and p_{i} are the number of variates and the proportion of variates of type X respectively in the i^{th} sample.

===Index of isolation===

The index of isolation is

$II = \sum_{ i = 1 }^K \frac{ A_i }{ A } \frac{ A_i }{ t_i }$

where K is the number of units in the study, A_{i} and t_{i} is the number of units of type A and the number of all units in i_{th} sample.

A modified index of isolation has also been proposed

$MII = \frac{ II - \frac{ A }{ T } }{ 1 - \frac{ A }{ T } }$

The MII lies between 0 and 1.

===Gorard's index of segregation===

This index (GS) is defined as

$GS = \frac 1 2 \sum_{i = 1}^K \left| \frac{A_i} A - \frac{t_i} T \right|$

where

$A = \sum_{ i = 1 }^K A_i$

$T = \sum_{ i = 1 }^K t_i$

and A_{i} and t_{i} are the number of data items of type A and the total number of items in the i^{th} sample.

===Index of exposure===

This index is defined as

 $IE = \sum_{ i = 1 }^K \frac{ A_i }{ A } \frac{ B_i }{ t_i }$

where

$A = \sum_{ i = 1 }^K A_i$

and A_{i} and B_{i} are the number of types A and B in the i^{th} category and t_{i} is the total number of data points in the i^{th} category.

===Ochiai index===

This is a binary form of the cosine index. It is used to compare presence/absence data of two data types (here A and B). It is defined as

 $O = \frac{ a }{ \sqrt{ ( a + b )( a + c ) } }$

where a is the number of sample units where both A and B are found, b is number of sample units where A but not B occurs and c is the number of sample units where type B is present but not type A.

===Kulczyński's coefficient===

This coefficient was invented by Stanisław Kulczyński in 1927 and is an index of association between two types (here A and B). It varies in value between 0 and 1. It is defined as

 $K = \frac{ a }{ 2 } \left( \frac{ 1 }{ a + b } + \frac{ 1 }{ a + c } \right)$

where a is the number of sample units where type A and type B are present, b is the number of sample units where type A but not type B is present and c is the number of sample units where type B is present but not type A.

===Yule's Q===

This index was invented by Yule in 1900. It concerns the association of two different types (here A and B). It is defined as

 $Q = \frac{ ad - bc }{ ad + bc }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. Q varies in value between -1 and +1. In the ordinal case Q is known as the Goodman-Kruskal γ.

Because the denominator potentially may be zero, Leinhert and Sporer have recommended adding +1 to a, b, c and d.

===Yule's Y===

This index is defined as

 $Y = \frac{ \sqrt{ ad } - \sqrt{ bc } }{ \sqrt{ ad } + \sqrt{ bc } }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

===Baroni–Urbani–Buser coefficient===

This index was invented by Baroni-Urbani and Buser in 1976. It varies between 0 and 1 in value. It is defined as

$BUB = \frac{ \sqrt{ ad } + a }{ \sqrt{ ad } + a + b + c } = \frac{ \sqrt{ ad } + a }{ N + \sqrt{ ad } - d } =
1 - \frac{ N - ( a - d ) }{ N + \sqrt{ ad } - d }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

When d = 0, this index is identical to the Jaccard index.

===Hamman coefficient===

This coefficient is defined as

 $H = \frac{ ( a + d ) - ( b + c ) }{ a + b + c + d } = \frac{ ( a + d ) - ( b + c ) }{ N }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Rogers–Tanimoto coefficient===

This coefficient is defined as

 $RT = \frac{ a + d }{ a + 2( b + c ) + d } = \frac{ a + d }{ N + b + c }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size

===Sokal–Sneath coefficient===

This coefficient is defined as

 $SS = \frac{ 2( a + d ) }{ 2( a + d ) + b + c } = \frac{ 2( a + d ) }{ N + a + d }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Sokal's binary distance===

This coefficient is defined as

 $SBD = \sqrt{ \frac{ b + c }{ a + b + c + d } } = \sqrt{ \frac{ b + c }{ N } }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Russel–Rao coefficient===

This coefficient is defined as

 $RR = \frac{ a }{ a + b + c + d } = \frac{ a }{ N }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Phi coefficient===

This coefficient is defined as

 $\varphi = \frac{ ad - bc }{ \sqrt{ ( a + b ) ( a + c ) ( b + c ) ( c + d ) } }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

===Soergel's coefficient===

This coefficient is defined as

 $S = \frac{ b + c }{ b + c + d } = \frac{ b + c }{ N - a }$

where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Simpson's coefficient===

This coefficient is defined as

 $S = \frac{ a }{ a + \min( b, c ) }$

where b is the number of samples where type A is present but not type B, c is the number of samples where type B is present but not type A.

===Dennis' coefficient===

This coefficient is defined as

 $D = \frac{ ad - bc }{ \sqrt{ ( a + b + c + d ) ( a + b ) ( a + c ) } } = \frac{ ad - bc }{ \sqrt{ N ( a + b ) ( a + c ) } }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Forbes' coefficient===

This coefficient was proposed by Stephen Alfred Forbes in 1907. It is defined as

 $F = \frac{ a N }{ ( a + b ) ( a + c ) }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size (N = a + b + c + d).

A modification of this coefficient which does not require the knowledge of d has been proposed by Alroy

 $F_{ A } = \frac{ a ( n + \sqrt{ n } ) } { a ( n + \sqrt{ n } ) + \frac{ 3 }{ 2 } bc } = 1 - \frac{ 3 bc } { 2 a ( n + \sqrt{ n } ) + 3 bc }$

Where n = a + b + c.

===Simple match coefficient===

This coefficient is defined as

 $SM = \frac{ a + d }{ a + b + c + d } = \frac{ a + d }{ N }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Fossum's coefficient===

This coefficient is defined as

 $F = \frac{ ( a + b + c + d ) ( a - 0.5 )^2 }{ ( a + b ) ( a + c ) } = \frac{ N ( a - 0.5 )^2 }{ ( a + b ) ( a + c ) }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Stile's coefficient===

This coefficient is defined as

 $S = \log \left[ \frac{ n ( | ad - bc | - \frac{ n }{ 2 } )^2 }{ ( a + b ) ( a + c ) ( b + d )( c + d ) } \right]$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A, d is the sample count where neither type A nor type B are present, n equals a + b + c + d and || is the modulus (absolute value) of the difference.

===Michael's coefficient===

This coefficient is defined as

 $M = \frac{ 4 ( ad - bc ) }{ ( a + d )^2 + ( b + c )^2 }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

===Peirce's coefficient===

In 1884 Charles Peirce suggested the following coefficient

 $P = \frac{ ab + bc }{ ab + 2bc + cd }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present.

===Hawkin–Dotson coefficient===

In 1975 Hawkin and Dotson proposed the following coefficient

 $HD = \frac{ 1 }{ 2 } \left( \frac{ a }{ a + b + c } + \frac{ d }{ b + c + d }
\right) = \frac{ 1 }{ 2 } \left( \frac{ a }{ N - d } + \frac{ d }{ N - a }
\right)$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Benini coefficient===

In 1901 Benini proposed the following coefficient

 $B = \frac{ a - ( a + b )( a + c ) } { a + \min( b, c ) - ( a + b )( a + c ) }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. Min(b, c) is the minimum of b and c.

===Gilbert coefficient===

Gilbert proposed the following coefficient

 $G = \frac{ a - ( a + b )( a + c ) } { a + b + c - ( a + b )( a + c ) } = \frac{ a - ( a + b )( a + c ) } { N - ( a + b )( a + c ) - d }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the sample count where neither type A nor type B are present. N is the sample size.

===Gini index===

The Gini index is

 $G = \frac{ a - ( a + b )( a + c ) } { \sqrt{ ( 1 - ( a + b )^2 ) ( 1 - ( a + c )^2 ) } }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A.

===Modified Gini index===

The modified Gini index is

 $G_M = \frac{ a - ( a + b )( a + c ) } { 1 - \frac{ | b - c | }{ 2 } - ( a + b )( a + c ) }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A.

===Kuhn's index===

Kuhn proposed the following coefficient in 1965

 $I = \frac{ 2 ( ad - bc ) }{ K ( 2a + b + c ) } = \frac{ 2 ( ad - bc ) }{ K ( N + a - d ) }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B and c is the number of samples where type B is present but not type A. K is a normalizing parameter. N is the sample size.

This index is also known as the coefficient of arithmetic means.

===Eyraud index===

Eyraud proposed the following coefficient in 1936

 $I = \frac{ a - ( a + b )( a + c ) }{ ( a + c )( a + d )( b + d )( c + d ) }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present.

===Soergel distance===

This is defined as

 $\operatorname{SD} = \frac{ b + c }{ b + c + d } = \frac{ b + c }{ N - a }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size.

===Tanimoto index===

This is defined as

 $TI = 1 - \frac{ a }{ b + c + d } = 1 - \frac{ a }{ N - a }= \frac{ N - 2a }{ N - a }$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A and d is the number of samples where both A and B are not present. N is the sample size.

===Piatetsky–Shapiro's index===

This is defined as

 $PSI = a - bc$

where a is the number of samples where types A and B are both present, b is where type A is present but not type B, c is the number of samples where type B is present but not type A.

==Indices for comparison between two or more samples==

===Czekanowski's quantitative index===

This is also known as the Bray–Curtis index, Schoener's index, least common percentage index, index of affinity or proportional similarity. It is related to the Sørensen similarity index.

 $CZI = \frac{ \sum \min( x_i, x_j ) }{ \sum ( x_i + x_j ) }$

where x_{i} and x_{j} are the number of species in sites i and j respectively and the minimum is taken over the number of species in common between the two sites.

===Canberra metric===

The Canberra distance is a weighted version of the L_{1} metric. It was introduced by introduced in 1966 and refined in 1967 by G. N. Lance and W. T. Williams. It is used to define a distance between two vectors – here two sites with K categories within each site.

The Canberra distance d between vectors p and q in a K-dimensional real vector space is

$d ( \mathbf{ p }, \mathbf{ q } ) = \sum_{ i = 1 }^n \frac{ |p_i - q_i |}{ | p_i| + |q_i | }$

where p_{i} and q_{i} are the values of the i^{th} category of the two vectors.

===Sorensen's coefficient of community===

This is used to measure similarities between communities.

 $CC = \frac{ 2c } { s_1 + s_2 }$

where s_{1} and s_{2} are the number of species in community 1 and 2 respectively and c is the number of species common to both areas.

===Jaccard's index===

This is a measure of the similarity between two samples:

 $J = \frac A { A + B + C }$

where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.

This index was invented in 1902 by the Swiss botanist Paul Jaccard.

Under a random distribution the expected value of J is

 $J = \frac 1 A \left( \frac 1 { A + B + C } \right)$

The standard error of this index with the assumption of a random distribution is

$SE( J ) = \sqrt{ \frac{ A ( B + C ) } { N ( A + B + C )^3 } }$

where N is the total size of the sample.

===Dice's index===

This is a measure of the similarity between two samples:

 $D = \frac{ 2A }{ 2A + B + C }$

where A is the number of data points shared between the two samples and B and C are the data points found only in the first and second samples respectively.

===Match coefficient===

This is a measure of the similarity between two samples:

 $M = \frac{ N - B - C }{ N } = 1 - \frac{ B + C }{ N }$

where N is the number of data points in the two samples and B and C are the data points found only in the first and second samples respectively.

===Morisita's index===

Masaaki Morisita's index of dispersion ( I_{m} ) is the scaled probability that two points chosen at random from the whole population are in the same sample. Higher values indicate a more clumped distribution.

 $I_m = \frac { \sum x ( x - 1 ) } { n m ( m - 1 ) }$

An alternative formulation is

 $I_m = n \frac{ \sum x^2 - \sum x } { \left( \sum x \right)^2 - \sum x }$

where n is the total sample size, m is the sample mean and x are the individual values with the sum taken over the whole sample. It is also equal to

 $I_m = \frac { n\ IMC } { nm - 1 }$

where IMC is Lloyd's index of crowding.

This index is relatively independent of the population density but is affected by the sample size.

Morisita showed that the statistic

 $I_m \left( \sum x - 1 \right) + n - \sum x$

is distributed as a chi-squared variable with n − 1 degrees of freedom.

An alternative significance test for this index has been developed for large samples.

 $z = \frac { I_m - 1 } { 2 / n m^2 }$

where m is the overall sample mean, n is the number of sample units and z is the normal distribution abscissa. Significance is tested by comparing the value of z against the values of the normal distribution.

===Morisita's overlap index===

Morisita's overlap index is used to compare overlap among samples. The index is based on the assumption that increasing the size of the samples will increase the diversity because it will include different habitats

 $C_D = \frac{ 2 \sum_{ i = 1 }^S x_i y_i } { ( D_x + D_y ) XY }$

 x_{i} is the number of times species i is represented in the total X from one sample.
 y_{i} is the number of times species i is represented in the total Y from another sample.
 D_{x} and D_{y} are the Simpson's index values for the x and y samples respectively.
 S is the number of unique species

C_{D} = 0 if the two samples do not overlap in terms of species, and C_{D} = 1 if the species occur in the same proportions in both samples.

Horn's introduced a modification of the index

$C_H = \frac{ 2 \sum_{ i = 1 }^S x_i y_i }{ \left( { \sum_{ i = 1}^S x_i^2 \over X^2 } + {\sum_{ i = 1 }^S y_i^2 \over Y^2 } \right) X Y }$

===Standardised Morisita's index===

Smith-Gill developed a statistic based on Morisita's index which is independent of both sample size and population density and bounded by −1 and +1. This statistic is calculated as follows

First determine Morisita's index ( I_{d} ) in the usual fashion. Then let k be the number of units the population was sampled from. Calculate the two critical values

 $M_u = \frac { \chi^2_{ 0.975 } - k + \sum x } { \sum x - 1 }$

 $M_c = \frac { \chi^2_{ 0.025 } - k + \sum x } { \sum x - 1 }$

where χ^{2} is the chi square value for n − 1 degrees of freedom at the 97.5% and 2.5% levels of confidence.

The standardised index ( I_{p} ) is then calculated from one of the formulae below

When I_{d} ≥ M_{c} > 1

 $I_p = 0.5 + 0.5 \left( \frac { I_d - M_c } { k - M_c } \right)$

When M_{c} > I_{d} ≥ 1

 $I_p = 0.5 \left( \frac { I_d - 1 } { M_u - 1 } \right)$

When 1 > I_{d} ≥ M_{u}

 $I_p = -0.5 \left( \frac { I_d - 1 } { M_u - 1 } \right)$

When 1 > M_{u} > I_{d}

 $I_p = -0.5 + 0.5 \left( \frac { I_d - M_u } { M_u } \right)$

I_{p} ranges between +1 and −1 with 95% confidence intervals of ±0.5. I_{p} has the value of 0 if the pattern is random; if the pattern is uniform, I_{p} < 0 and if the pattern shows aggregation, I_{p} > 0.

===Peet's evenness indices===

These indices are a measure of evenness between samples.

 $E_1 = \frac{ I - I_\min }{ I_\max - I_\min }$

 $E_2 = \frac{ I }{ I_\max }$

where I is an index of diversity, I_{max} and I_{min} are the maximum and minimum values of I between the samples being compared.

===Loevinger's coefficient===

Loevinger has suggested a coefficient H defined as follows:

 $H = \sqrt{ \frac{ p_{\max} ( 1- p_{\min} ) } { p_{\min} (1-p_{\max} ) } }$

where p_{max} and p_{min} are the maximum and minimum proportions in the sample.

===Tversky index===

The Tversky index is an asymmetric measure that lies between 0 and 1.

For samples A and B the Tversky index (S) is

 $S = \frac{ | A \cap B | }{ | A \cap B | + \alpha | A - B | + \beta | B - A | }$

The values of α and β are arbitrary. Setting both α and β to 0.5 gives Dice's coefficient. Setting both to 1 gives Tanimoto's coefficient.

A symmetrical variant of this index has also been proposed.

 $S_1 = \frac{ | A \cap B | }{ | A \cap B |+ \beta \left( \alpha a + ( 1 - \alpha )b \right) }$

where

 $a = \min \left( | X - Y |, | Y - X |\right )$

 $b = \max \left( | X - Y |, | Y - X | \right)$

Several similar indices have been proposed.

Monostori et al. proposed the SymmetricSimilarity index

 $SS( A, B ) = \frac{ | d( A ) \cap d( B ) | }{ | d( A ) + d( B ) | }$

where d(X) is some measure of derived from X.

Bernstein and Zobel have proposed the S2 and S3 indexes

 $S2 = \frac{ | d( A ) \cap d( B ) | }{ \min ( | d( A ) |, | d( B ) ) | }$

 $S3 = \frac{ 2 | d( A ) \cap d( B ) | }{ | d( A ) + d( B ) | }$

S3 is simply twice the SymmetricSimilarity index. Both are related to Dice's coefficient

==Metrics used==

A number of metrics (distances between samples) have been proposed.

===Euclidean distance===

While this is usually used in quantitative work it may also be used in qualitative work. This is defined as

 $d_{ jk } = \sqrt { \sum_{ i = 1 }^N ( x_{ ij } - x_{ ik } )^2 }$

where d_{jk} is the distance between x_{ij} and x_{ik}.

===Gower's distance===

This is defined as

 $GD = \frac{ \Sigma_{ i = 1 }^n w_i d_i }{ \Sigma_{ i = 1 }^n w_i }$

where d_{i} is the distance between the i^{th} samples and w_{i} is the weighing give to the i^{th} distance.

===Manhattan distance===

While this is more commonly used in quantitative work it may also be used in qualitative work. This is defined as

 $d_{ jk } = \sum_{ i = 1 }^N | x_{ ij } - x_{ ik } |$

where d_{jk} is the distance between x_{ij} and x_{ik} and || is the absolute value of the difference between x_{ij} and x_{ik}.

A modified version of the Manhattan distance can be used to find a zero (root) of a polynomial of any degree using Lill's method.

===Prevosti's distance===

This is related to the Manhattan distance. It was described by Prevosti et al. and was used to compare differences between chromosomes. Let P and Q be two collections of r finite probability distributions. Let these distributions have values that are divided into k categories. Then the distance D_{PQ} is

 $D_{PQ} = \frac{ 1 }{ r } \sum_{ j = 1 }^r \sum_{ i = 1 }^k | p_{ ji } - q_{ ji } |$

where r is the number of discrete probability distributions in each population, k_{j} is the number of categories in distributions P_{j} and Q_{j} and p_{ji} (respectively q_{ji}) is the theoretical probability of category i in distribution P_{j} (Q_{j}) in population P(Q).

Its statistical properties were examined by Sanchez et al. who recommended a bootstrap procedure to estimate confidence intervals when testing for differences between samples.

===Other metrics===

Let

 $A = \sum x_{ ij }$

 $B = \sum x_{ ik }$

 $J = \sum \min ( x_{ ij }, x_{ jk } )$

where min(x,y) is the lesser value of the pair x and y.

Then

 $d_{ jk } = A + B - 2J$

is the Manhattan distance,

 $d_{ jk } = \frac{ A + B - 2J }{ A + B }$

is the Bray−Curtis distance,

 $d_{ jk } = \frac{ A + B - 2J }{ A + B - J }$

is the Jaccard (or Ruzicka) distance and

 $d_{ jk } = 1 - \frac{ 1 }{ 2 } \left( \frac{ J }{ A } + \frac{ J }{ B } \right)$

is the Kulczynski distance.

===Similarities between texts===

HaCohen-Kerner et al. have proposed a variety of metrics for comparing two or more texts.

==Ordinal data==

If the categories are at least ordinal then a number of other indices may be computed.

===Leik's D===

Leik's measure of dispersion (D) is one such index. Let there be K categories and let p_{i} be f_{i}/N where f_{i} is the number in the i^{th} category and let the categories be arranged in ascending order. Let

$c_a = \sum^a_{ i = 1 } p_i$

where a ≤ K. Let d_{a} = c_{a} if c_{a} ≤ 0.5 and 1 − c_{a} ≤ 0.5 otherwise. Then

 $D = 2 \sum_{ a = 1 }^K \frac{ d_a }{ K - 1 }$

===Normalised Herfindahl measure===

This is the square of the coefficient of variation divided by N − 1 where N is the sample size.

$H = \frac{ 1 }{ N - 1 } \frac{ s^2 }{ m^2 }$

where m is the mean and s is the standard deviation.

===Potential-for-conflict Index===

The potential-for-conflict Index (PCI) describes the ratio of scoring on either side of a rating scale's centre point. This index requires at least ordinal data. This ratio is often displayed as a bubble graph.

The PCI uses an ordinal scale with an odd number of rating points (−n to +n) centred at 0. It is calculated as follows

 $PCI = \frac{ X_t }{ Z } \left[ 1 - \left| \frac{ \sum_{ i = 1 }^{ r_+ } X_+ }{ X_t } - \frac{ \sum _{ i = 1 }^{ r_- } X_-} { X_t } \right| \right]$

where Z = 2n, |·| is the absolute value (modulus), r_{+} is the number of responses in the positive side of the scale, r_{−} is the number of responses in the negative side of the scale, X_{+} are the responses on the positive side of the scale, X_{−} are the responses on the negative side of the scale and

$X_t = \sum_{ i = 1 }^{ r_+ } | X_+ | + \sum_{ i = 1 }^{ r_- } | X_- |$

Theoretical difficulties are known to exist with the PCI. The PCI can be computed only for scales with a neutral center point and an equal number of response options on either side of it. Also a uniform distribution of responses does not always yield the midpoint of the PCI statistic but rather varies with the number of possible responses or values in the scale. For example, five-, seven- and nine-point scales with a uniform distribution of responses give PCIs of 0.60, 0.57 and 0.50 respectively.

The first of these problems is relatively minor as most ordinal scales with an even number of response can be extended (or reduced) by a single value to give an odd number of possible responses. Scale can usually be recentred if this is required. The second problem is more difficult to resolve and may limit the PCI's applicability.

The PCI has been extended

 $PCI_2 = \frac{ \sum_{ i = 1 }^K \sum_{ j = 1 }^i k_i k_j d_{ ij } }{ \delta }$

where K is the number of categories, k_{i} is the number in the i^{th} category, d_{ij} is the distance between the i^{th} and i^{th} categories, and δ is the maximum distance on the scale multiplied by the number of times it can occur in the sample. For a sample with an even number of data points

$\delta = \frac{ N^2 }{ 2 } d_\max$

and for a sample with an odd number of data points

$\delta = \frac{ N^2 - 1 }{ 2 } d_\max$

where N is the number of data points in the sample and d_{max} is the maximum distance between points on the scale.

Vaske et al. suggest a number of possible distance measures for use with this index.

 $D_1: d_{ ij } = | r_i - r_j | - 1$

if the signs (+ or −) of r_{i} and r_{j} differ. If the signs are the same d_{ij} = 0.

 $D_2: d_{ ij } = | r_i - r_j |$

 $D_3: d_{ ij } = | r_i - r_j |^p$

where p is an arbitrary real number > 0.

 $Dp_{ ij }: d_{ ij } = [ | r_i - r_j | - ( m - 1 ) ]^p$

if sign(r_{i} ) ≠ sign(r_{i} ) and p is a real number > 0. If the signs are the same then d_{ij} = 0. m is D_{1}, D_{2} or D_{3}.

The difference between D_{1} and D_{2} is that the first does not include neutrals in the distance while the latter does. For example, respondents scoring −2 and +1 would have a distance of 2 under D_{1} and 3 under D_{2}.

The use of a power (p) in the distances allows for the rescaling of extreme responses. These differences can be highlighted with p > 1 or diminished with p < 1.

In simulations with a variates drawn from a uniform distribution the PCI_{2} has a symmetric unimodal distribution. The tails of its distribution are larger than those of a normal distribution.

Vaske et al. suggest the use of a t test to compare the values of the PCI between samples if the PCIs are approximately normally distributed.

===van der Eijk's A ===

This measure is a weighted average of the degree of agreement the frequency distribution. A ranges from −1 (perfect bimodality) to +1 (perfect unimodality). It is defined as

 $A = U \left( 1 - \frac{ S - 1 }{ K - 1 } \right)$

where U is the unimodality of the distribution, S the number of categories that have nonzero frequencies and K the total number of categories.

The value of U is 1 if the distribution has any of the three following characteristics:

- all responses are in a single category
- the responses are evenly distributed among all the categories
- the responses are evenly distributed among two or more contiguous categories, with the other categories with zero responses

With distributions other than these the data must be divided into 'layers'. Within a layer the responses are either equal or zero. The categories do not have to be contiguous. A value for A for each layer (A_{i}) is calculated and a weighted average for the distribution is determined. The weights (w_{i}) for each layer are the number of responses in that layer. In symbols

$A_\mathrm{overall} = \sum w_i A_i$

A uniform distribution has A = 0: when all the responses fall into one category A = +1.

One theoretical problem with this index is that it assumes that the intervals are equally spaced. This may limit its applicability.

==Related statistics==

===Birthday problem===

If there are n units in the sample and they are randomly distributed into k categories (n ≤ k), this can be considered a variant of the birthday problem. The probability (p) of all the categories having only one unit is

 $p = \prod_{ i = 1 }^n \left( 1 - \frac{ i }{ k } \right)$

If c is large and n is small compared with k^{2/3} then to a good approximation

 $p = \exp\left( \frac{ -n^2 } { 2k } \right)$

This approximation follows from the exact formula as follows:

 $\log_e \left( 1 - \frac{ i }{ k } \right) \approx - \frac{ i }{ k }$

;Sample size estimates

For p = 0.5 and p = 0.05 respectively the following estimates of n may be useful

 $n = 1.2 \sqrt{ k }$

 $n = 2.448 \sqrt{ k } \approx 2.5 \sqrt{ k }$

This analysis can be extended to multiple categories. For p = 0.5 and p 0.05 we have respectively

 $n = 1.2 \sqrt{ \frac{ 1 }{ \sum_{ i = 1 }^k \frac{ 1 }{ c_i } } }$

 $n \approx 2.5 \sqrt{ \frac{ 1 }{ \sum_{ i = 1 }^k \frac{ 1 }{ c_i } } }$

where c_{i} is the size of the i^{th} category. This analysis assumes that the categories are independent.

If the data is ordered in some fashion then for at least one event occurring in two categories lying within j categories of each other than a probability of 0.5 or 0.05 requires a sample size (n) respectively of

$n = 1.2 \sqrt { \frac{ k }{ 2j + 1 } }$

$n \approx 2.5 \sqrt { \frac{ k }{ 2j + 1 } }$

where k is the number of categories.

===Birthday-death day problem===

Whether or not there is a relation between birthdays and death days has been investigated with the statistic

$- \log_{10} \left( \frac{ 1 + 2 d }{ 365 } \right),$

where d is the number of days in the year between the birthday and the death day.

===Rand index===

The Rand index is used to test whether two or more classification systems agree on a data set.

Given a set of $n$ elements $S = \{o_1, \ldots, o_n\}$ and two partitions of $S$ to compare, $X = \{X _1, \ldots, X_r \}$, a partition of S into r subsets, and $Y = \{ Y_1, \ldots, Y_s \}$, a partition of S into s subsets, define the following:

- $a$, the number of pairs of elements in $S$ that are in the same subset in $X$ and in the same subset in $Y$
- $b$, the number of pairs of elements in $S$ that are in different subsets in $X$ and in different subsets in $Y$
- $c$, the number of pairs of elements in $S$ that are in the same subset in $X$ and in different subsets in $Y$
- $d$, the number of pairs of elements in $S$ that are in different subsets in $X$ and in the same subset in $Y$

The Rand index - $R$ - is defined as

$R = \frac{ a + b }{ a + b + c + d } = \frac{ a+ b }{ { n \choose 2 } }$
Intuitively, $a + b$ can be considered as the number of agreements between $X$ and $Y$ and $c + d$ as the number of disagreements between $X$ and $Y$.

===Adjusted Rand index===
The adjusted Rand index is the corrected-for-chance version of the Rand index. Though the Rand Index may only yield a value between 0 and +1, the adjusted Rand index can yield negative values if the index is less than the expected index.

====The contingency table====
Given a set $S$ of $n$ elements, and two groupings or partitions (e.g. clusterings) of these points, namely $X = \{ X_1, X_2, \ldots , X_r \}$ and $Y = \{ Y_1, Y_2, \ldots , Y_s \}$, the overlap between $X$ and $Y$ can be summarized in a contingency table $\left[ n_{ ij } \right]$ where each entry $n_{ ij }$ denotes the number of objects in common between $X_i$ and $Y_j$ : $n_{ ij }= |X_i \cap Y_j |$.

| X\Y | $Y_1$ | $Y_2$ | $\ldots$ | $Y_s$ | Sums |
| $X_1$ | $n_{ 11 }$ | $n_{ 12 }$ | $\ldots$ | $n_{ 1s }$ | $a_1$ |
| $X_2$ | $n_{ 21 }$ | $n_{ 22 }$ | $\ldots$ | $n_{ 2s }$ | $a_2$ |
| $\vdots$ | $\vdots$ | $\vdots$ | $\ddots$ | $\vdots$ | $\vdots$ |
| $X_r$ | $n_{ r1 }$ | $n_{ r2 }$ | $\ldots$ | $n_{ rs }$ | $a_r$ |
| Sums | $b_1$ | $b_2$ | $\ldots$ | $b_s$ | |

====Definition====

The adjusted form of the Rand Index, the Adjusted Rand Index, is

 $\text{AdjustedIndex} = \frac{ \text{Index} - \text{ExpectedIndex} }{ \text{MaxIndex} - \text{ExpectedIndex}},$

more specifically

 $\text{ARI} = \frac{ \sum_{ij} \binom{n_{ij}} 2 - \left. \left[ \sum_i \binom{a_i} 2 \sum_j \binom{b_j} 2\right] \right/ \binom n 2} {\frac 1 2 \left[ \sum_i \binom{a_i}2 + \sum_j \binom{b_j} 2 \right] - \left. \left[ \sum_i \binom{a_i} 2 \sum_j \binom{b_j} 2 \right] \right/ \binom n 2}$

where $n_{ij}, a_i, b_j$ are values from the contingency table.

Since the denominator is the total number of pairs, the Rand index represents the frequency of occurrence of agreements over the total pairs, or the probability that $X$ and $Y$ will agree on a randomly chosen pair.

==Evaluation of indices==

Different indices give different values of variation, and may be used for different purposes: several are used and critiqued in the sociology literature especially.

If one wishes to simply make ordinal comparisons between samples (is one sample more or less varied than another), the choice of IQV is relatively less important, as they will often give the same ordering.

Where the data is ordinal a method that may be of use in comparing samples is ORDANOVA.

In some cases it is useful to not standardize an index to run from 0 to 1, regardless of number of categories or samples , but one generally so standardizes it.

== See also ==
- ANOSIM
- Categorical data
- Diversity index
- Fowlkes–Mallows index
- Goodman and Kruskal's gamma
- Information entropy
- Logarithmic distribution
- PERMANOVA
- Robinson–Foulds metric
- Statistical dispersion
- Variation ratio
- Whipple's index
