# Hopkins statistic

Jump to navigation Jump to search

The Hopkins statistic (introduced by Brian Hopkins and John Gordon Skellam) is a way of measuring the cluster tendency of a data set. It belongs to the family of sparse sampling tests. It acts as a statistical hypothesis test where the null hypothesis is that the data is generated by a Poisson point process and are thus uniformly randomly distributed. A value close to 1 tends to indicate the data is highly clustered, random data will tend to result in values around 0.5, and uniformly distributed data will tend to result in values close to 0[citation needed].

## Preliminaries

A typical formulation of the Hopkins statistic follows.

Let $X$ be the set of $n$ data points.
Consider a random sample (without replacement) of $m\ll n$ data points with members $x_{i}$ .
Generate a set $Y$ of $m$ uniformly randomly distributed data points.
Define two distance measures,
$u_{i},$ the distance of $y_{i}\in Y$ from its nearest neighbour in $X$ , and
$w_{i},$ the distance of $x_{i}\in X$ from its nearest neighbour in $X$ .

## Definition

With the above notation, if the data is $d$ dimensional, then the Hopkins statistic is defined as:

$H={\frac {\sum _{i=1}^{m}{u_{i}^{d}}}{\sum _{i=1}^{m}{u_{i}^{d}}+\sum _{i=1}^{m}{w_{i}^{d}}}}\,$ ## Notes and references

1. ^ Hopkins, Brian; Skellam, John Gordon (1954). "A new method for determining the type of distribution of plant individuals". Annals of Botany. Annals Botany Co. 18 (2): 213–227.
2. ^ a b Banerjee, A. (2004). "Validating clusters using the Hopkins statistic". IEEE International Conference on Fuzzy Systems: 149–153. doi:10.1109/FUZZY.2004.1375706.