Since formulation of Inductive Database (IDB) problem, several Data Mining (DM) languages have been proposed, confirming that KDD process could be supported via inductive queries (IQ) answering. This paper reviews the existing DM languages. We are presenting important primitives of the DM language and classifying our languages according to primitives' satisfaction. In addition, we presented languages' syntaxes and tried to apply each one to a database sample to test a set of KDD operations. This study allows us to highlight languages capabilities and limits, which is very useful for future work and perspectives.
Deep Dive into Comparative Study Of Data Mining Query Languages.
Since formulation of Inductive Database (IDB) problem, several Data Mining (DM) languages have been proposed, confirming that KDD process could be supported via inductive queries (IQ) answering. This paper reviews the existing DM languages. We are presenting important primitives of the DM language and classifying our languages according to primitives’ satisfaction. In addition, we presented languages’ syntaxes and tried to apply each one to a database sample to test a set of KDD operations. This study allows us to highlight languages capabilities and limits, which is very useful for future work and perspectives.
COMPARATIVE STUDY OF DATA MINING QUERY
LANGUAGES
Mohamed Anis Bach Tobji
LARODEC Laboratory – Institut Supérieur de Gestion de Tunis
41 rue de la liberté Bouchoucha, 2000, Tunis - Tunisia
ABSTRACT
Since formulation of Inductive Database (IDB) problem, several Data Mining (DM) languages have been proposed,
confirming that KDD process could be supported via inductive queries (IQ) answering. This paper reviews the existing
DM languages. We are presenting important primitives of the DM language and classifying our languages according to
primitives‟ satisfaction. In addition, we presented languages‟ syntaxes and tried to apply each one to a database sample to
test a set of KDD operations. This study allows us to highlight languages capabilities and limits, which is very useful for
future work and perspectives.
KEYWORDS
Knowledge Discovery from Databases, Inductive Database, Data Mining Languages.
1. INTRODUCTION
IDB is a new generation of databases introduced in (Imielinski and Mannila, 1996) as a framework of
KDD (Fayyad et al, 1996). An inductive database contains data and patterns that are extracted from.
Databases are generally supported by SQL language, however, IDBs are supported by a DM Query
Language, which allows KDD operations (mainly data selection, data preprocessing, patterns mining and
pattern post-processing).
The development of theoretical framework is interesting and has been the subject of many researches
(Boulicaut et al, 1999), (De Raedt, 2003), (Dan Lee and De Raedt, 2003), (De Raedt et al, 2004). However,
there is no clear definition or formalization, such as an algebra language that could be a base for a standard
DM query language. In fact, the KDD community would reproduce the success of SQL based on Codd‟s
algebra (Codd, 1970).
In this paper we study existing DM languages to try to find out advantages and limits. The paper is
organized as the following: In section 2 we present essential DM query language primitives. In section 3 we
compare six existing DM query languages with a taxonomy based on primitives‟ satisfaction. In section 4, we
show the languages in action, i.e., we give a small database and we perform some data mining operations
using languages‟ queries. Finally in section 5, we discuss the study, and we give some perspectives related to
the existent languages weaknesses.
2. INDUCTIVE QUERY LANGUAGE PRIMITIVES
Data mining query language primitives‟ definition is a basic problem. Once primitives are defined,
conceiving a good DM query language will be easier. In this section we give the primitives as defined in
(Han and Kamber,2000), (Botta et al, 2004), and languages papers (Imielinski and Virmani, 1999), (Meo et
al, 2002), (Han et al,1996), (Morzy and Zakrzewic, 1997), (Netz et al,2000) and (Elfeky et al, 2000).
A data mining query language must offer:
- Data selection: it‟s naturally satisfied if the language nests SQL. The language must provide data selection query.
- Pre-processing task: providing pre-processing operations (sampling, discretization, data cleaning etc. )
- Specifying the data mining task: mining several patterns kinds (decision trees, sequential and association rules etc).
- Specification of background knowledge: background knowledge is information about the application field. This
primitive offers to the Data Miner the opportunity to specify his domain knowledge which affects positively the mined
knowledge quality. Concept hierarchy is the most used background knowledge (Han and Kamber,2000).
- Specification of constraints mining: specification of constraints set that the patterns must satisfy.
- Closure property: the result of data mining query could be re-queried such as for SQL.
- Post-processing task: the user should be able to query extracted patterns, cross over patterns and data etc.
3. THE COMPARATIVE TABLE
In this section, we study six DM query languages. We present these languages according to a set of
properties corresponding to the primitives defined in the previous section. Thus, we classify the languages in
a table such that rows correspond to properties and columns to languages. Eeach cell (crossing a property Pi
and a language Lj) is the satisfaction degree of the property Pi by the language Lj (see table 1).
Table 1 contains two parts. In the first one, each language is described generally (language authors,
design, year etc). In the second part, we present the functionalities provided by each language as explained on
the top.
4. DATA MINING QUERY LANGUAGES IN ACTION
In this section, we explore DM query languages capabilities and we present the syntax of each language.
In addition, we set a database example about supermarket sales (see table 2) and tried to write some queries
to perform KDD operations that turn around the DM step, mainly in order to extract association rules since
their mining is provided by all the languages.
4.1. MSQL
MSQL has four main queries:
-
Create
…(Full text truncated)…
This content is AI-processed based on ArXiv data.