= Sample entropy =

Sample entropy (SampEn; more appropriately K_2 entropy or Takens–Grassberger–Procaccia correlation entropy ) is a modification of approximate entropy (ApEn; more appropriately "Procaccia–Cohen entropy"), used for assessing the complexity of physiological and other time-series signals, diagnosing e.g. diseased states. SampEn has two advantages over ApEn: data length independence and a relatively trouble-free implementation. Also, there is a small computational difference: In ApEn, the comparison between the template vector (see below) and the rest of the vectors also includes comparison with itself. This guarantees that probabilities $C_{i}'^{m}(r)$ are never zero. Consequently, it is always possible to take a logarithm of probabilities. Because template comparisons with itself lower ApEn values, the signals are interpreted to be more regular than they actually are. These self-matches are not included in SampEn. However, since SampEn makes direct use of the correlation integrals, it is not a real measure of information but an approximation. The foundations and differences with ApEn, as well as a step-by-step tutorial for its application is available at.

SampEn is indeed identical to the "correlation entropy" K_2 of Grassberger & Procaccia, except that it is suggested in the latter that certain limits should be taken in order to achieve a result invariant under changes of variables. No such limits and no invariance properties are considered in SampEn.

There is a multiscale version of SampEn as well, suggested by Costa and others. SampEn can be used in biomedical and biomechanical research, for example to evaluate postural control.

== Definition ==
Like approximate entropy (ApEn), Sample entropy (SampEn) is a measure of complexity. But it does not include self-similar patterns as ApEn does. For a given embedding dimension $m$, tolerance $r$ and number of data points $N$, SampEn is the negative natural logarithm of the probability that if two sets of simultaneous data points of length $m$ have distance $< r$ then two sets of simultaneous data points of length $m+1$ also have distance $< r$. And we represent it by $SampEn(m,r,N)$ (or by $SampEn(m,r,\tau,N)$ including sampling time $\tau$).

Now assume we have a time-series data set of length $N = { \{ x_1 , x_2 , x_3 , . . . , x_N \} }$ with a constant time interval $\tau$. We define a template vector of length $m$, such that $X_m (i)={ \{ x_i , x_{i+1} , x_{i+2} , . . . , x_{i+m-1} \} }$ and the distance function $d[X_m(i),X_m(j)]$ (i≠j) is to be the Chebyshev distance (but it could be any distance function, including Euclidean distance). We define the sample entropy to be

$SampEn=-\ln {A \over B}$

Where

$A$ = number of template vector pairs having $d[X_{m+1}(i),X_{m+1}(j)] < r$

$B$ = number of template vector pairs having $d[X_m(i),X_m(j)] < r$

It is clear from the definition that $A$ will always have a value smaller or equal to $B$. Therefore, $SampEn(m,r,\tau)$ will be always either be zero or positive value. A smaller value of $SampEn$ also indicates more self-similarity in data set or less noise.

Generally we take the value of $m$ to be $2$ and the value of $r$ to be $0.2 \times std$.
Where std stands for standard deviation which should be taken over a very large dataset. For instance, the r value of 6 ms is appropriate for sample entropy calculations of heart rate intervals, since this corresponds to $0.2 \times std$ for a very large population.

== Multiscale SampEn ==
The definition mentioned above is a special case of multi scale sampEn with $\delta=1$, where $\delta$ is called skipping parameter. In multiscale SampEn template vectors are defined with a certain interval between its elements, specified by the value of $\delta$. And modified template vector is defined as
$X_{m,\delta}(i)={x_i,x_{i+\delta},x_{i+2\times\delta},...,x_{i+(m-1)\times\delta} }$
and sampEn can be written as
$SampEn \left ( m,r,\delta \right )=-\ln { A_\delta \over B_\delta }$
And we calculate $A_\delta$ and $B_\delta$ like before.

== Implementation ==
Sample entropy can be implemented easily in many different programming languages. Below lies an example written in Python.
<syntaxhighlight lang="python">
from itertools import combinations
from math import log

def construct_templates(timeseries_data: list, m: int = 2):
    num_windows = len(timeseries_data) - m + 1
    return [timeseries_data[x : x + m] for x in range(0, num_windows)]

def get_matches(templates: list, r: float) -> int:
    return len(
        list(filter(lambda x: is_match(x[0], x[1], r), combinations(templates, 2)))
    )

def is_match(template_1: list, template_2: list, r: float) -> bool:
    return all([abs(x - y) < r for (x, y) in zip(template_1, template_2)])

def sample_entropy(timeseries_data: list, window_size: int, r: float):
    B = get_matches(construct_templates(timeseries_data, window_size), r)
    A = get_matches(construct_templates(timeseries_data, window_size + 1), r)
    return -log(A / B)
</syntaxhighlight>

An equivalent example in numerical Python.
<syntaxhighlight lang="numpy">
import numpy

def construct_templates(timeseries_data, m):
    num_windows = len(timeseries_data) - m + 1
    return numpy.array([timeseries_data[x : x + m] for x in range(0, num_windows)])

def get_matches(templates, r) -> int:
    return len(
        list(filter(lambda x: is_match(x[0], x[1], r), combinations(templates)))
    )

def combinations(x):
    idx = numpy.stack(numpy.triu_indices(len(x), k=1), axis=-1)
    return x[idx]

def is_match(template_1, template_2, r) -> bool:
    return numpy.all([abs(x - y) < r for (x, y) in zip(template_1, template_2)])

def sample_entropy(timeseries_data, window_size, r):
    B = get_matches(construct_templates(timeseries_data, window_size), r)
    A = get_matches(construct_templates(timeseries_data, window_size + 1), r)
    return -numpy.log(A / B)
</syntaxhighlight>

An example written in other languages can be found:
- Matlab
- R.
- Rust

== See also ==

- Kolmogorov complexity
- Approximate entropy
