Local outlier factor: Difference between revisions

Content deleted Content added

Inline

Revision as of 08:18, 3 July 2013

In anomaly detection, the local outlier factor (LOF) is an algorithm proposed by Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng and Jörg Sander in 2000 for finding anomalous data points by measuring the local deviation of a given data point with respect to its neighbours.^[1]

LOF shares some concepts with DBSCAN and OPTICS such as the concepts of "core distance" and "reachability distance", which are used for local density estimation.^[2]

Basic idea

As indicated by the title, the local outlier factor is based on a concept of a local density, where locality is given by $k$ nearest neighbors, whose distance is used to estimate the density. By comparing the local density of an object to the local densities of its neighbors, one can identify regions of similar density, and points that have a substantially lower density than their neighbors. These are considered to be outliers.

The local density is estimated by the typical distance at which a point can be "reached" from its neighbors. The definition of "reachability distance" used in LOF is an additional measure to produce more stable results within clusters.

Formal

Let ${\mbox{k-distance}}(A)$ be the distance of the object $A$ to the k nearest neighbor. Note that the set of the k nearest neighbors includes all objects at this distance, which can in the case of a "tie" be more than k objects. We denote the set of k nearest neighbors as $N_{k}(A)$ .

Illustration of the reachability distance. Objects B and C have the same reachability distance (k=3), while D is not a k nearest neighbor

This distance is used to define what is called reachability distance:

${\mbox{reachability-distance}}_{k}(A,B)=\max\{{\mbox{k-distance}}(B),d(A,B)\}$

In words, the reachability distance of an object $A$ from $B$ is the true distance of the two objects, but at least the ${\mbox{k-distance}}$ of $B$ . Objects that belong to the k nearest neighbors of $B$ (the "core" of $B$ , see DBSCAN cluster analysis) are considered to be equally distant. The reason for this distance is to get more stable results. Note that this is not a distance in the mathematical definition, since it is not symmetric.

The local reachability density of an object $A$ is defined by

${\mbox{lrd}}(A):=1/\left({\frac {\sum _{B\in N_{k}(A)}{\mbox{reachability-distance}}_{k}(A,B)}{|N_{k}(A)|}}\right)$

Which is the quotient of the average reachability distance of the object $A$ from its neighbors. Note that it is not the average reachability of the neighbors from $A$ (which by definition would be the ${\mbox{k-distance}}(A)$ ), but the distance at which it can be "reached" from its neighbors. With duplicate points, this value can become infinite.

The local reachability densities are then compared with those of the neighbors using

${\mbox{LOF}}_{k}(A):={\frac {\sum _{B\in N_{k}(A)}{\frac {{\mbox{lrd}}(B)}{{\mbox{lrd}}(A)}}}{|N_{k}(A)|}}={\frac {\sum _{B\in N_{k}(A)}{\mbox{lrd}}(B)}{|N_{k}(A)|}}/{\mbox{lrd}}(A)$

Which is the average local reachability density of the neighbors divided by the objects own local reachability density. A value of approximately $1$ indicates that the object is comparable to its neighbors (and thus not an outlier). A value below $1$ indicates a denser region (which would be an inlier), while values significantly larger than $1$ indicate outliers.

Advantages

LOF scores as visualized by ELKI. While the upper right cluster has a comparable density to the outliers close to the bottom left cluster, they are detected correctly.

Due to the local approach, LOF is able to identify outliers in a data set that would not be outliers in another area of the data set. For example, a point at a "small" distance to a very dense cluster is an outlier, while a point within a sparse cluster might exhibit similar distances to its neighbors.

While the geometric intuition of LOF is only applicable to low dimensional vector spaces, the algorithm can be applied in any context a dissimilarity function can be defined. It has experimentally been shown to work very well in numerous setups, often outperforming the competitors, for example in network intrusion detection.^[3]

Disadvantages and Extensions

The resulting values are quotient-values and hard to interpret. A value of 1 or even less indicates a clear inlier, but there is no clear rule for when a point is an outlier. In one data set, a value of 1.1 may already be an outlier, in another dataset and parameterization (with strong local fluctuations) a value of 2 could still be an inlier. These differences can also occur within a dataset due to the locality of the method. There exist extensions of LOF that try to improve over LOF in these aspects:

Feature Bagging for Outlier Detection ^[4] runs LOF on multiple projections and combines the results for improved detection qualities in high dimensions.
Local Outlier Probability (LoOP)^[5] is a method derived from LOF but using inexpensive local statistics to become less sensitive to the choice of the parameter k. In addition, the resulting values are scaled to a value range of $[0:1]$ .
Interpreting and Unifying Outlier Scores ^[6] proposes a normalization of the LOF outlier scores to the interval $[0:1]$ using statistical scaling to increase usability and can be seen a improved version of the LoOP ideas.
On Evaluation of Outlier Rankings and Outlier Scores ^[7] proposes methods for measuring similarity and diversity of methods for building advanced outlier detection ensembles using LOF variants and other algorithms and improving on the Feature Bagging approach discussed above.
Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection^[8] discusses the general pattern in various local outlier detection methods (including e.g. LOF, a simplified version of LOF and LoOP) and abstracts from this into a general framework. This framework is then applied e.g. to detecting outliers in geographic data, video streams and authorship networks.

References

^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/335191.335388, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/335191.335388 instead.
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/978-3-540-48247-5_28, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/978-3-540-48247-5_28 instead.
^ "A comparative study of anomaly detection schemes in network intrusion detection" (PDF). Proc. 3rd SIAM International Conference on Data Mining: 25–36. 2003. {{cite journal}}: Cite uses deprecated parameter |authors= (help)
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1081870.1081891, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/1081870.1081891 instead.
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1645953.1646195, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/1645953.1646195 instead.
^ "Interpreting and Unifying Outlier Scores" (PDF). Proc. 11th SIAM International Conference on Data Mining. 2011. {{cite journal}}: Cite uses deprecated parameter |authors= (help)
^ "On Evaluation of Outlier Rankings and Outlier Scores" (PDF). Proc. 12 SIAM International Conference on Data Mining. 2012. {{cite journal}}: Cite uses deprecated parameter |authors= (help)
^ Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/s10618-012-0300-z, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/s10618-012-0300-z instead.

[1] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/335191.335388, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/335191.335388 instead.

[2] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/978-3-540-48247-5_28, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/978-3-540-48247-5_28 instead.

[3] "A comparative study of anomaly detection schemes in network intrusion detection" (PDF). Proc. 3rd SIAM International Conference on Data Mining: 25–36. 2003. {{cite journal}}: Cite uses deprecated parameter |authors= (help)

[4] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1081870.1081891, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/1081870.1081891 instead.

[5] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi:10.1145/1645953.1646195, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi=10.1145/1645953.1646195 instead.

[6] "Interpreting and Unifying Outlier Scores" (PDF). Proc. 11th SIAM International Conference on Data Mining. 2011. {{cite journal}}: Cite uses deprecated parameter |authors= (help)

[7] "On Evaluation of Outlier Rankings and Outlier Scores" (PDF). Proc. 12 SIAM International Conference on Data Mining. 2012. {{cite journal}}: Cite uses deprecated parameter |authors= (help)

[generalized-8] Attention: This template ({{cite doi}}) is deprecated. To cite the publication identified by doi: 10.1007/s10618-012-0300-z, please use {{cite journal}} (if it was published in a bona fide academic journal, otherwise {{cite report}} with |doi= 10.1007/s10618-012-0300-z instead.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

@@ Line 54: / Line 54: @@
 * ''Interpreting and Unifying Outlier Scores'' <ref>{{cite journal | title=Interpreting and Unifying Outlier Scores | year=2011 | authors=[[Hans-Peter Kriegel]], Peer Kröger, Erich Schubert, Arthur Zimek | journal=Proc. 11th SIAM International Conference on Data Mining | url=http://siam.omnibooksonline.com/2011datamining/data/papers/018.pdf }}</ref> proposes a normalization of the LOF outlier scores to the interval <math>[0:1]</math> using statistical scaling to increase [[usability]] and can be seen a improved version of the LoOP ideas.
 * ''On Evaluation of Outlier Rankings and Outlier Scores'' <ref>{{cite journal | title=On Evaluation of Outlier Rankings and Outlier Scores | year=2012 | authors=Erich Schubert, Remigius Wojdanowski, [[Hans-Peter Kriegel]], Arthur Zimek | journal=Proc. 12 SIAM International Conference on Data Mining | url=http://siam.omnibooksonline.com/2012datamining/data/papers/107.pdf }}</ref> proposes methods for measuring similarity and diversity of methods for building advanced outlier detection [[Ensemble learning|ensembles]] using LOF variants and other algorithms and improving on the Feature Bagging approach discussed above.
+* ''Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection''<ref name="generalized">{{cite doi | 10.1007/s10618-012-0300-z}}</ref> discusses the general pattern in various local outlier detection methods (including e.g. LOF, a simplified version of LOF and LoOP) and abstracts from this into a general framework. This framework is then applied e.g. to detecting outliers in geographic data, video streams and authorship networks.
 == References ==