Supporting Knowledge Discovery in Data Stream Management Systems
p>
« back to results for ""
Below is a cache of http://www.cs.ucla.edu/~hthakkar/proposal.pdf. It's a snapshot of the page taken as our search engine crawled the Web.
The web site itself may have changed. You can check the current page or check for previous versions at the Internet Archive.
Yahoo! is not affiliated with the authors of this page or responsible for its content.
Supporting Knowledge Discovery in Data Stream Management Systems
Supporting Knowledge Discovery in Data Stream
Management Systems
Hetal Thakkar
Advisor: Carlo Zaniolo
UCLA
hthakkar@cs.ucla.edu
April, 2007
Abstract
Data mining represents an exciting and vibrant area of research. In particular,
on-line mining has gained signicant momentum in recent years. The changing
data characteristics and real-time response constraints of streaming data preclude
the use of existing mining algorithms that were designed for stored datasets. There-
fore, researchers are proposing new fast and light algorithms for on-line mining
tasks, such as classication, clustering, frequent itemsets, pattern matching, and
many others. Beyond the interesting research problems posed by the design of in-
dividual algorithms for different mining methods, there remains the issue that these
must be integrated into an Inductive Data Stream Mining System that supports (i)
libraries of interoperable mining methods, (ii) all essential functions of data stream
management systems, such as continuous query optimization, load shedding, syn-
optic constructs, and non-stop computing, and (iii) ease-of-use and extensibility.
The issue of Inductive Data Stream Mining System has received little attention in
the past, and thus offers a pristine opportunity for research contributions in my
thesis work. Furthermore, there will be an opportunity for advancing the state of
the art in mining algorithms, particularly in terms of extensibility, interoperabil-
ity, and genericity of data representation. In this prospectus, I briey discuss data
stream management systems, data mining methods and systems, and other areas
closely related to the topic of this thesis. Then, I present recently obtained prelimi-
nary results, including data representations for achieving generic implementations
for many on-line mining algorithms. Furthermore, I discuss advanced techniques
such as ensemble-based bagging and boosting, for which generic implementations
were also devised. Finally, preliminary experiments are presented to verify the ef-
ciency of the proposed approach. Future work will focus on more experiments
and ne-tuning of mining algorithms. Furthermore, support for high-level mining
languages also represents a promising topic for future research.
Contents
1
Introduction
2
2
State-of-the-art: Inductive Database Management Systems (IDBMSs)
4
3
State-of-the-art: Data Stream Management Systems (DSMSs)
7
3.1
ESL and its Expressive Power
. . . . . . . . . . . . . . . . . . .
8
4
On-line Mining in Stream Mill
12
4.1
On-line Mining Algorithms in Stream Mill . . . . . . . . . . . . .
13
4.1.1
On-line Classiers . . . . . . . . . . . . . . . . . . . . .
13
4.1.2
On-line Clustering . . . . . . . . . . . . . . . . . . . . .
16
4.1.3
Association Rules
. . . . . . . . . . . . . . . . . . . . .
17
4.1.4
Sequence Querying . . . . . . . . . . . . . . . . . . . . .
18
4.2
Generic UDAs . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
4.3
Concept-Drifts and Shifts . . . . . . . . . . . . . . . . . . . . . .
21
4.3.1
Ensemble Based Weighted Bagging . . . . . . . . . . . .
21
4.3.2
Adaptive Boosting . . . . . . . . . . . . . . . . . . . . .
23
4.4
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
4.5
High-Level Mining Language (Future Work) . . . . . . . . . . . .
26
5
Conclusion and Future Work
29
1
Chapter 1
Introduction
Data mining has received much attention in database community over the last
decade [7, 14, 15, 20, 21, 45, 26]. Similarly, research on data streams has also
received considerable interest [11, 13, 35, 38, 36, 46, 44, 9]. At the conuence of
these to research areas is on-line data stream mining, which has started to become
the focus of a few research projects [14, 21, 15, 47, 19, 26]. This interest is largely
due to the growing set of streaming applications where mining plays a critical role,
these include network trafc monitoring, web click-stream analysis, highway traf-
c congestion analysis, market basket data mining, credit card fraud detection, etc.
Furthermore, applications that require on-line data mining are ever increasing due
to the vast amount of data being generated by business applications.
On-line data stream mining raises many new problems that were not encoun-
tered in static data mining. For instance, changing data characteristics represent
the rst such problem. Since the data is continuously arriving the data values
may experience a change in the distribution, for example mean and variance for
a temperature reading may change as seasons change. Furthermore, the underly-
ing concepts that generate the data may also change, for example in credit card
fraud detection, the types of frauds may evolve over time. Thus, the changing
data characteristics call for an active learning mechanism, i.e. a classier learned
over stale credit card transactions may not be accurate in predicting behavior of
current transactions. Thus, we must continuously/actively learn the data mining
model over the latest training data. Second problem is the real-time response con-
straints of streaming data that prohibit application of complex mining algorithms
requiring multiple passes of the data. In fact, these applications may sacrice the
accuracy of the mining models in order to support the real-time constraints. These
new problems present two research challenges.
New algorithms that suit the requirements of the on-line applications, are
2
needed.
Systems that efciently support such algorithms are also required.
Recent years have seen emergence of many on-line mining algorithms [14, 21,
15, 47, 19, 26], however the research in this area is not complete by any means,
i.e. algorithms for frequent itemsets mining over sliding windows are still required
(discussed more later). Where as the second challenge above, nding systems
that can efciently support a wide variety of these algorithms, has not received
much research attention. In addition to standard data stream management system
(DSMS) features, load shedding, synopses, approximate query answering, query
scheduling, such a system would have to efciently support an assortment of on-
line mining algorithms. Furthermore, many tools and advanced techniques for on-
line mining also have to be supported. This prospectus precisely addresses this sec-
ond challenge by proposing a system that effectively supports on-line data mining.
Furthermore, the proposed system is extensible and allows easy implementation of
new mining algorithms.
The rest of this prospectus is organized as follows. Chapter 2 overviews ex-
isting Inductive Database Management Systems (IDBMSs). Where as Chapter 3
discusses existing Data Stream Management Systems (DSMSs). In particular we
focus on the Stream Mill system, which provides unique opportunities for support-
ing on-line mining. The next Chapter (Chapter 4), discusses integration of many
on-line mining algorithms in Stream Mill and the many challenges that need to be
solved to achieve such an integration. This Chapter also discusses genericity of
mining algorithms and advanced techniques for on-line mining. The Chapter also
proposes support for a high level mining query language, which is an important
future work direction for my thesis. Finally, Chapter 5 concludes this thesis with
discussion of other future research directions.
Chapter 2
State-of-the-art: Inductive
Database Management Systems
(IDBMSs)
Data mining reveals hidden patterns in the underlying data, which results in accu-
rate business process modeling and decision making. For instance, in a retail store
scenario, association rule mining allows strategic placement of items. Agrawal et al
[8] present the apriori algorithm to solve this problem efciently. Since then many
improvements have been suggested to expedite this process. Similarly, many data
mining algorithms have been proposed for other static mining tasks [8, 31, 30, 20].
Introduction of these static data mining algorithms presented a research chal-
lenge similar to the one mentioned above for on-line mining algorithms, i.e. sys-
tems that can efciently support these algorithms were required. An obvious so-
lution in the case of static data mining was supporting data mining algorithms in
a DBMS. While the problem drew much interest from vendors and researchers in
mid-90s, effective solutions did not come quickly. Indeed performing data min-
ing tasks using DBMS-provided constructs and functions proved to be exceedingly
difcult [42]. Sarawagi et al [42], attempted to implement frequent itemset mining
in DB2 without any