# Isolation forest

Isolation forest is an unsupervised learning algorithm for anomaly detection that works on the principle of isolating anomalies,[1] instead of the most common techniques of profiling normal points.[2]

Fig. 1 - example web traffic with potentially anomalous points.

In statistics, an anomaly (a.k.a. outlier) is an observation or event that deviates so much from other events to arouse suspicion it was generated by a different mean. For example, the graph in Fig.1 represents ingress traffic to a web server, expressed as the number of requests in 3-hours intervals, for a period of one month. It is quite evident by simply looking at the picture that some points (marked with a red circle) are unusually high, to the point of inducing suspect that the web server might have been under attack at that time. On the other hand, the flat segment indicated by the red arrow also seems unusual and might possibly be a sign that the server was down during that time period.

Anomalies in a big dataset may follow very complicated patterns, which are difficult to detect “by eye” in the great majority of cases. This is the reason why the field of anomaly detection is well suited for the application of Machine Learning techniques.

The most common techniques employed for anomaly detection are based on the construction of a profile of what is “normal”: anomalies are reported as those instances in the dataset that do not conform to the normal profile.[2] Isolation Forest uses a different approach: instead of trying to build a model of normal instances, it explicitly isolates anomalous points in the dataset. The main advantage of this approach is the possibility of exploiting sampling techniques to an extent that is not allowed to the profile-based methods, creating a very fast algorithm with a low memory demand.[1][3][4]

## History

The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008.[1] The authors took advantage of two quantitative properties of anomalous data points in a sample:

1. Few - they are the minority consisting of fewer instances and
2. Different - they have attribute-values that are very different from those of normal instances

Since anomalies are "few and different", they are easier to “isolate” compared to normal points. Isolation Forest builds an ensemble of “Isolation Trees” (iTrees) for the data set, and anomalies are the points that have shorter average path lengths on the iTrees.

In a later paper, published in 2012[2] the same authors described a set of experiments to prove that iForest:

• has a low linear time complexity and a small memory requirement
• is able to deal with high dimensional data with irrelevant attributes
• can be trained with or without anomalies in the training set
• can provide detection results with different levels of granularity without re-training

In 2013 Zhiguo Ding and Minrui Fei proposed a framework based on iForest to resolve the problem of detecting anomalies in streaming data.[5] More application of iForest to streaming data are described in papers by Tan et al.,[4] Susto et al.[6] and Weng et al.[7]

One of the main problems of the application of iForest to anomaly detection was not with the model itself, but rather in the way the “anomaly score” was computed. This problem was highlighted by Sahand Hariri, Matias Carrasco Kind and Robert J. Brunner in a 2018 paper,[8] wherein they proposed an improved iForest model named Extended Isolation Forest (EIF). In the same paper the authors describe the improvements made to the original model and how they are able to enhance the consistency and reliability of the anomaly score produced for a given data point.

## Algorithm

Fig. 2 - an example of isolating a non-anomalous point in a 2D Gaussian distribution.

At the basis of the Isolation Forest algorithm, there is the tendency of anomalous instances in a dataset to be easier to separate from the rest of the sample (isolate), compared to normal points. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value for the attribute, between the minimum and maximum values allowed for that attribute.

Fig. 3 - an example of isolating an anomalous point in a 2D Gaussian distribution.

An example of random partitioning in a 2D dataset of normally distributed points is given in Fig. 2 for a non-anomalous point and Fig. 3 for a point that's more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.

From a mathematical point of view, recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point ${\displaystyle x_{i}}$ in Fig. 2 is greater than the path length of ${\displaystyle x_{j}}$ in Fig. 3.

More formally, let ${\displaystyle X=\{x_{1},\dots ,x_{n}\}}$ be a set of d-dimensional points and ${\displaystyle X'\subset X}$. An Isolation Tree (iTree) is defined as a data structure with the following properties:

1. for each node ${\displaystyle T}$ in the Tree, ${\displaystyle T}$ is either an external-node with no child, or an internal-node with one “test” and exactly two daughter nodes (${\displaystyle T_{l}}$ and ${\displaystyle T_{r}}$)
2. a test at node ${\displaystyle T}$ consists of an attribute ${\displaystyle q}$ and a split value ${\displaystyle p}$ such that the test ${\displaystyle q determines the traversal of a data point to either ${\displaystyle T_{l}}$ or ${\displaystyle T_{r}}$.

In order to build an iTree, the algorithm recursively divides ${\displaystyle X'}$ by randomly selecting an attribute ${\displaystyle q}$ and a split value ${\displaystyle p}$, until either

1. the node has only one instance, or
2. all data at the node have the same values.

When the iTree is fully grown, each point in ${\displaystyle X}$ is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length ${\displaystyle h(x_{i})}$ of point ${\displaystyle x_{i}\in X}$ is defined as the number of edges ${\displaystyle x_{i}}$ traverses from the root node to get to an external node.

A probabilistic explanation of iTree is provided in the iForest original paper.[1]

## Properties of isolation forest

• Sub-sampling: since iForest does not need to isolate all of normal instances, it can frequently ignore the big majority of the training sample. As a consequence, iForest works very well when the sampling size is kept small, a property that is in contrast with the great majority of existing methods, where large sampling size is usually desirable.[1][2]
• Swamping: when normal instances are too close to anomalies, the number of partitions required to separate anomalies increases, a phenomenon known as swamping, which makes it more difficult for iForest to discriminate between anomalies and normal points. One of the main reasons for swamping is the presence of too many data for the purpose of anomaly detection, which implies one possible solution to the problem is sub-sampling. Since iForest respond very well to sub-sampling in terms of performance, the reduction of the number of points in the sample is also a good way to reduce the effect of swamping.[1]
• Masking: when the number of anomalies is high it is possible that some of those aggregate in a dense and large cluster, making it more difficult to separate the single anomalies and, in turn, to detect such points as anomalous. Similarly to swamping, this phenomenon (known as “masking”) is also more likely when the number of points in the sample is big, and can be alleviated through sub-sampling.[1]
• High Dimensional Data: one of the main limitation to standard, distance-based methods is their inefficiency in dealing with high dimensional datasets:.[9] The main reason for that is, in a high dimensional space every point is equally sparse, so using a distance-based measure of separation is pretty ineffective. Unfortunately, high-dimensional data also affects the detection performance of iForest, but the performance can be vastly improved by adding a features selection test like Kurtosis to reduce the dimensionality of the sample space.[1][3]
• Normal Instances Only: iForest performs well even if the training set does not contain any anomalous point,[3] the reason being that iForest describes data distributions in such a way that high values of the path length ${\displaystyle h(x_{i})}$ correspond to the presence of data points. As a consequence, the presence of anomalies is pretty irrelevant to iForest's detection performance.

## Anomaly detection with isolation forest

Anomaly detection with Isolation Forest is a process composed of two main stages:[3]

1. in the first stage, a training dataset is used to build iTrees as described in previous sections.
2. in the second stage, each instance in the test set is passed through the iTrees build in the previous stage, and a proper “anomaly score” is assigned to the instance using the algorithm described below

Once all the instances in the test set have been assigned an anomaly score, it is possible to mark as “anomaly” any point whose score is greater than a predefined threshold, which depends on the domain the analysis is being applied to.

### Anomaly score

The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST.[3] As a consequence, the estimation of average ${\displaystyle h(x)}$ for external node terminations is the same as that of the unsuccessful searches in BST, that is[10]

${\displaystyle c(m)={\begin{cases}2H(m-1)-{\frac {2(m-1)}{n}}&{\text{for }}m>2\\1&{\text{for }}m=2\\0&{\text{otherwise}}\end{cases}}}$

where ${\displaystyle n}$ is the testing data size, ${\displaystyle m}$ is the size of the sample set and ${\displaystyle H}$ is the harmonic number, which can be estimated by ${\displaystyle H(i)=ln(i)+\gamma }$, where ${\displaystyle \gamma =0.5772156649}$ is the Euler-Mascheroni constant.

The value of c(m) above represents the average of ${\displaystyle h(x)}$ given ${\displaystyle m}$, so we can use it to normalise ${\displaystyle h(x)}$ and get an estimation of the anomaly score for a given instance x:

${\displaystyle s(x,m)=2^{\frac {-E(h(x))}{c(m)}}}$

where ${\displaystyle E(h(x))}$ is the average value of ${\displaystyle h(x)}$ from a collection of iTrees. It is interesting to note that for any given instance ${\displaystyle x}$:

• if ${\displaystyle s}$ is close to ${\displaystyle 1}$ then ${\displaystyle x}$ is very likely to be an anomaly
• if ${\displaystyle s}$ is smaller than ${\displaystyle 0.5}$ then ${\displaystyle x}$ is likely to be a normal value
• if for a given sample all instances are assigned an anomaly score of around ${\displaystyle 0.5}$, then it is safe to assume that the sample doesn't have any anomaly

## Extended isolation forest

As described in the previous sections, the Isolation Forest algorithm performs very well from both the computational and the memory consumption points of view. The main problem with the original algorithm is that the way the branching of trees takes place introduces a bias, which is likely to reduce the reliability of the anomaly scores for ranking the data. This is the main motivation behind the introduction of the Extended Isolation Forest (EIF) algorithm by Hariri et al.[8]

Fig. 4 - two dimensional normally distributed points with zero mean and unity covariance matrix

In order to understand why the original Isolation Forest suffers from that bias, the authors provide a practical example based on a random dataset taken from a 2-D normal distribution with zero mean and covariance given by the identity matrix. An example of such a dataset is shown in Fig. 4.

It is easy to understand by looking at the picture that points falling close to (0, 0) are likely to be normal points, while a point that lies far away from (0, 0) is likely to be anomalous. As a consequence, the anomaly score of a point should increase with an almost circular and symmetric pattern as the point move radially outward the “centre” of the distribution. This is not the case in practice as the authors demonstrate by generating the anomaly score map produced for the distribution by the Isolation Forest algorithm. Although the anomaly scores correctly increase as the points move radially outward, they also generate rectangular regions of lower anomaly score in the x and y directions, compared to other points that fall roughly at the same radial distance from the centre.

Fig. 5 - random partitioning with EIF

It is possible to demonstrate that these unexpected rectangular regions in the anomaly score map are indeed an artifact introduced by the algorithm and are mainly due to the fact that the decision boundaries of Isolation Forest are limited to be either vertical or horizontal (see Fig. 2 and Fig. 3).[8]

This is the reason why in their paper, Hariri et al. propose to improve the original Isolation Forest in the following way: rather than selecting a random feature and value within the range of data, they select a branch cut that has a random “slope”. An example of random partitioning with EIF is shown in Fig. 5.

The authors show how the new approach is able to overcome the limits of the original Isolation Forest, eventually leading to an improved anomaly score map.