Detecting outliers which are grossly different from or inconsistent with the remaining dataset is a major challenge in real-world KDD applications. Existing outlier detection methods are ineffective on scattered real-world datasets due to implicit data patterns and parameter setting issues. We define a novel "Local Distance-based Outlier Factor" (LDOF) to measure the {outlier-ness} of objects in scattered datasets which addresses these issues. LDOF uses the relative location of an object to its neighbours to determine the degree to which the object deviates from its neighbourhood. Properties of LDOF are theoretically analysed including LDOF's lower bound and its false-detection probability, as well as parameter settings. In order to facilitate parameter settings in real-world applications, we employ a top-n technique in our outlier detection approach, where only the objects with the highest LDOF values are regarded as outliers. Compared to conventional approaches (such as top-n KNN and top-n LOF), our method top-n LDOF is more effective at detecting outliers in scattered data. It is also easier to set parameters, since its performance is relatively stable over a large range of parameter values, as illustrated by experimental results on both real-world and synthetic datasets.
Of all the data mining techniques that are in vogue, outlier detection comes closest to the metaphor of mining for nuggets of information in real-world data. It is concerned with discovering the exceptional behavior of certain objects [TCFC02]. Outlier detection techniques have widely been applied in medicine (e.g. adverse reactions analysis), finance (e.g. financial fraud detection), security (e.g. counter-terrorism), information security (e.g. intrusions detection) and so on. In the recent decades, many outlier detection approaches have been proposed, which can be broadly classified into several categories: distribution-based [Bar94], depth-based [Tuk77], distance-based (e.g. KNN) [KN98], cluster-based (e.g. DBSCAN) [EKSX96] and density-based (e.g. LOF) [BKNS00] methods.
However, these methods are often unsuitable in real-world applications due to a number of reasons. Firstly, real-world data usually have a scattered distribution, where objects are loosely distributed in the domain feature space. That is, from a ’local’ point of view, these objects cannot represent explicit patterns (e.g. clusters) to indicate normal data ‘behavior’. However, from a ‘global’ point of view, scattered objects constitute several mini-clusters, which represent the pattern of a subset of objects. Only the objects which do not belong to any other object groups are genuine outliers. Unfortunately, existing outlier definitions depend on the assumption that most objects are crowded in a few main clusters. They are incapable of dealing with scattered datasets, because mini-clusters in the dataset evoke a high false-detection rate (or low precision).
Secondly, it is difficult in current outlier detection approaches to set accurate parameters for real-world datasets . Most outlier algorithms must be tuned through trial-and-error [FZFW06]. This is impractical, because real-world data usually do not contain labels for anomalous objects. In addition, it is hard to evaluate detection performance without the confirmation of domain experts. Therefore, the detection result will be uncontrollable if parameters are not properly chosen.
To alleviate the parameter setting problem, researchers proposed top-n style outlier detection methods. Instead of a binary outlier indicator, top-n outlier methods provide a ranked list of objects to represent the degree of ‘outlier-ness’ for each object. The users (domain experts) can re-examine the selected top-n (where n is typically far smaller than the cardinality of dataset) anomalous objects to locate real outliers. Since this detection procedure can provide a good interaction between data mining experts and users, top-n outlier detection methods become popular in real-world applications.
Distance-based, top-n K th -Nearest Neighbour distance [RRS00] is a typical top-n style outlier detection approach. In order to distinguish from the original distancebased outlier detection method in [KN98], we denote K th -Nearest Neighbour distance outlier as top-n KNN in this paper. In top-n KNN outlier, the distance from an object to its k th nearest neighbour (denoted as k-distance for short) indicates outlier-ness of the object. Intuitively, the larger the k-distance is, the higher outlier-ness the object has. Top-n KNN outlier regards the n objects with the highest values of k-distance as outliers [RRS00].
A density-based outlier, Local Outlier Factor (LOF) [BKNS00], was proposed in the same year as top-n KNN. In LOF, an outlier factor is assigned for each object w.r.t its surrounding neighbourhood. The outlier factor depends on how the data object is closely packed in its locally reachable neighbourhood [FZFW06]. Since LOF uses a threshold to differentiate outliers from normal objects [BKNS00], the same problem of parameter setting arises. A lower outlier-ness threshold will produce high false-detection rate, while a high threshold value will result in missing genuine outliers. In recent real-world applications, researchers have found it more reliable to use LOF in a top-n manner [TCFC02], i.e. only objects with the highest LOF values will be considered outliers. Hereafter, we call it top-n LOF.
Besides top-n KNN and top-n LOF, researchers have proposed other methods to deal with real-world data, such as the connectivity-based (COF) [TCFC02], and Resolution cluster-based (RB-outlier) [FZFW06]. Although the existing top-n style outlier detection techniques alleviate the difficulty of parameter setting, the detection precision of these methods (in this paper, we take top-n KNN and top-n LOF as typical examples) is low on scattered data. In Section 2, we will discuss further problems of top-n KNN and top-n LOF.
In this paper we propose a new outlier detection definition, named Local Distance-based Outlier Factor (LDOF), which is sensitive to outliers in scattered datasets. LDOF uses the relative distance from an object to its neighbours to measure how much objects deviate from their scattered neighbourhood.
This content is AI-processed based on open access ArXiv data.