Effective sample size

In statistics, effective sample size is a notion defined for a sample from a distribution when the observations in the sample are correlated or weighted. In 1965, Leslie Kish defined it as the original sample size divided by the design effect to reflect the variance from the current sampling design as compared to what would be if the sample was a simple random sample[1][2]: 162, 259

Correlated observations

Suppose a sample of several independent identically distributed observations ${\displaystyle Y_{1},\dots ,Y_{n}}$ is drawn from a distribution with mean ${\displaystyle \mu }$ and standard deviation ${\displaystyle \sigma }$. Then the mean of this distribution is estimated by the mean of the sample:

${\displaystyle {\hat {\mu }}={\frac {1}{n}}\sum _{i=1}^{n}Y_{i}.}$

In that case, the variance of ${\displaystyle {\hat {\mu }}}$ is given by

${\displaystyle \operatorname {Var} ({\hat {\mu }})={\frac {\sigma ^{2}}{n}}}$

However, if the observations in the sample are correlated (in the intraclass correlation sense), then ${\displaystyle \operatorname {Var} ({\hat {\mu }})}$ is somewhat higher. For instance, if all observations in the sample are completely correlated (${\displaystyle \rho _{(i,j)}=1}$), then ${\displaystyle \operatorname {Var} ({\hat {\mu }})=\sigma ^{2}}$ regardless of ${\displaystyle n}$.

The effective sample size ${\displaystyle n_{\text{eff}}}$ is the unique value (not necessarily an integer) such that

${\displaystyle \operatorname {Var} ({\hat {\mu }})={\frac {\sigma ^{2}}{n_{\text{eff}}}}.}$

${\displaystyle n_{\text{eff}}}$ is a function of the correlation between observations in the sample.

Suppose that all the (non-trivial) correlations are the same and greater than ${\displaystyle -1/(n-1)}$, i.e. if ${\displaystyle i\neq j}$, then ${\displaystyle \rho _{(i,j)}=\rho >-1/(n-1)}$. Then

{\displaystyle {\begin{aligned}\operatorname {Var} ({\hat {\mu }})&=\operatorname {Var} \left({\frac {1}{n}}Y_{1}+{\frac {1}{n}}Y_{2}+\cdots +{\frac {1}{n}}Y_{n}\right)\\[5pt]&=\sum _{i=1}^{n}{\frac {1}{n^{2}}}\operatorname {Var} (Y_{i})+\sum _{i=1}^{n}\sum _{j=1,j\neq i}^{n}{\frac {1}{n^{2}}}\operatorname {Cov} (Y_{i},Y_{j})\\[5pt]&=n{\frac {\sigma ^{2}}{n^{2}}}+n(n-1){\frac {\sigma ^{2}\rho }{n^{2}}}\\[5pt]&=\sigma ^{2}{\frac {1+(n-1)\rho }{n}}.\end{aligned}}}

Therefore

${\displaystyle n_{\text{eff}}={\frac {n}{1+(n-1)\rho }}.}$

In the case where ${\displaystyle \rho =0}$, then ${\displaystyle n_{\text{eff}}=n}$. Similarly, if ${\displaystyle \rho =1}$ then ${\displaystyle n_{\text{eff}}=1}$. And if ${\displaystyle -1/(n-1)<\rho <0}$ then ${\displaystyle n_{\text{eff}}>n}$.

The case where the correlations are not uniform is somewhat more complicated. Note that if the correlation is negative, the effective sample size may be larger than the actual sample size. If we allow the more general form ${\displaystyle {\hat {\mu }}=\sum _{i=1}^{n}a_{i}y_{i}}$ (where ${\displaystyle \sum _{i=1}^{n}a_{i}=1}$) then it is possible to construct correlation matrices that have an ${\displaystyle n_{\text{eff}}>n}$ even when all correlations are positive. Intuitively, the maximal value of ${\displaystyle n_{\text{eff}}}$ over all choices of the coefficients ${\displaystyle a_{i}}$ may be thought of as the information content of the observed data.

Weighted samples

If the data has been weighted (the weights don't have to be normalized, i.e. have their sum equal to 1 or n, or some other constant), then several observations composing a sample have been pulled from the distribution with effectively 100% correlation with some previous sample. In this case, the effect is known as Kish's Effective Sample Size[3][2]: 162, 259

${\displaystyle n_{\text{eff}}={\frac {n}{D_{\text{eff}}}}={\frac {n}{\frac {\overline {w^{2}}}{{\overline {w}}^{2}}}}={\frac {n}{\frac {{\frac {1}{n}}\sum _{i=1}^{n}w_{i}^{2}}{\left({\frac {1}{n}}\sum _{i=1}^{n}w_{i}\right)^{2}}}}={\frac {n}{\frac {n\sum _{i=1}^{n}w_{i}^{2}}{(\sum _{i=1}^{n}w_{i})^{2}}}}={\frac {(\sum _{i=1}^{n}w_{i})^{2}}{\sum _{i=1}^{n}w_{i}^{2}}}}$

References

1. ^ Tom Leinster (December 18, 2014). "Effective Sample Size".
2. ^ a b Kish, Leslie (1965). "Survey Sampling". New York: John Wiley & Sons, Inc. ISBN 0-471-10949-5. {{cite journal}}: Cite journal requires |journal= (help)
3. ^