Title: An ensemble approach for feature selection of Cyber Attack Dataset
ArXiv ID: 0912.1014
Date: 2009-12-08
Authors: Researchers from original ArXiv paper
📝 Abstract
Feature selection is an indispensable preprocessing step when mining huge datasets that can significantly improve the overall system performance. Therefore in this paper we focus on a hybrid approach of feature selection. This method falls into two phases. The filter phase select the features with highest information gain and guides the initialization of search process for wrapper phase whose output the final feature subset. The final feature subsets are passed through the Knearest neighbor classifier for classification of attacks. The effectiveness of this algorithm is demonstrated on DARPA KDDCUP99 cyber attack dataset.
💡 Deep Analysis
Deep Dive into An ensemble approach for feature selection of Cyber Attack Dataset.
Feature selection is an indispensable preprocessing step when mining huge datasets that can significantly improve the overall system performance. Therefore in this paper we focus on a hybrid approach of feature selection. This method falls into two phases. The filter phase select the features with highest information gain and guides the initialization of search process for wrapper phase whose output the final feature subset. The final feature subsets are passed through the Knearest neighbor classifier for classification of attacks. The effectiveness of this algorithm is demonstrated on DARPA KDDCUP99 cyber attack dataset.
📄 Full Content
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 6, No. 2, 2009
An ensemble approach for feature selection of Cyber
Attack Dataset
Shailendra Singh
Department of Information Technology
Rajiv Gandhi Technological University
Bhopal, India
e-mail:shailendrasingh@rgtu.net
Sanjay Silakari
Department of Computer Science & Engineering
Rajiv Gandhi Technological University
Bhopal, India
e-mail:ssilakari@rgtu.net
Abstract— Feature selection is an indispensable pre-processing
step when mining huge datasets that can significantly improve
the overall system performance. Therefore in this paper we focus
on a hybrid approach of feature selection. This method falls into
two phases. The filter phase select the features with highest
information gain and guides the initialization of search process
for wrapper phase whose output the final feature subset. The
final feature subsets are passed through the K-nearest neighbor
classifier for classification of attacks. The effectiveness of this
algorithm is demonstrated on DARPA KDDCUP99 cyber attack
dataset.
Keywords-Filter,
Wrapper,
Information
gain,
K-nearest
neighbor, KDDCUP99
I.
INTRODUCTION
Feature selection aims to choose an optimal subset of
features that are necessary and sufficient to describe the target
concept. It has proven in both theory and practice effective in
enhancing learning efficiency, increasing [1][2] predictive
accuracy and reducing complexity of learned results. Optimal
feature selection requires an exponentially large search space,
where N is the number of features [3]. So it may be too costly
and impractical. Many feature selection methods have been
proposed in recent years. The survey paper [4] gives the
complete scenario of different approaches used in cyber attack
detection systems. They can fall into two approaches: filter
and wrapper [5]. The difference between the filter model and
wrapper model is whether feature selection relies on any
learning algorithm. The filter model is independent of any
learning algorithm, and its advantages lies in better generality
and low computational cost [6]. The wrapper model relies on
some learning algorithm, and it can expect high classification
performance, but it is computationally expensive especially
when dealing with large scale data sets [7] like KDDCUP99.
This paper combines the two models to make use of their
advantages. We adopt a two-phase feature selection method.
The filter phase selects features and uses the feature estimation
as the heuristic information to guide wrapper algorithm. We
adopt information gain [8] uncertainty to get feature
estimation. The second phase is a data mining algorithm which
is used to estimate the accuracy of cyber attack detection. We
use K-nearest neighbor based wrapper selector. The feature
estimation obtained from the first phase is used for building
the initialization of the search process. The effectiveness of
this method is demonstrated through empirical study on
KDDCUP99 datasets [9].
II.
THE KDDCUP99 DATASET
In the 1998 DARPA cyber attack detection evaluation
program an environment [9] [10] was setup to acquire raw
TCP/IP dump data for a network by simulating a typical U.S.
Air Force LAN. The LAN was operated like a true
environment, but being blasted with multiple attacks. For each
TCP/IP connection, 41 various quantitative (continuous data
type) and qualitative (discrete data type) features were
extracted among the 41 features, 34 features are numeric and 7
features are symbolic. The data contains 24 attack types that
could be classified into four main categories:
• DOS: Denial Of Service attack.
• R2L: Remote to Local (User) attack.
• U2R: User to Root attack.
• Probing: Surveillance and other probing.
A. Denial of service Attack (DOS)
Denial of service (DOS) is class of attack where an attacker
makes a computing or memory resource too busy or too full to
handle legitimate requests, thus denying legitimate user access
to a machine.
B. Remote to Local (User) Attacks
A remote to local (R2L) attack is a class of attacks where
an attacker sends packets to a machine over network, then
exploits the machine’s vulnerability to illegally gain local
access to a machine.
C. User to Root Attacks
User to root (U2R) attacks is a class of attacks where an
attacker starts with access to a normal user account on the
system and is able to exploit vulnerability to gain root access to
the system.
D. Probing
Probing is class of attacks where an attacker scans a
network to gather information or find known vulnerabilities.
An attacker with map of machine and services that are
297
http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security, Vol. 6, No. 2, 2009
available on a network can use the information to notice for
exploit.
TABLE I. CLASS LABLE THAT APPEARS IN 10% DATA SET