extending dsms for data stream mining

37
1 Extending DSMS for Data Stream Mining CS240B Notes by Carlo Zaniolo UCLA CSD

Upload: taber

Post on 07-Jan-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Extending DSMS for Data Stream Mining. CS240B Notes by Carlo Zaniolo UCLA CSD. Data Streams. Continuous, unbounded, rapid, time-varying streams of data elements Occur in a variety of modern applications Network monitoring and traffic engineering Sensor networks, RFID tags - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Extending DSMS for  Data  Stream Mining

1

Extending DSMS for Data Stream Mining

CS240B Notesby

Carlo Zaniolo

UCLA CSD

Page 2: Extending DSMS for  Data  Stream Mining

2

Data Streams

Continuous, unbounded, rapid, time-varying streams of data elements

Occur in a variety of modern applications Network monitoring and traffic engineering Sensor networks, RFID tags Telecom call records Financial applications Web logs and click-streams Manufacturing processes

DSMSDSMS = Data Stream Management System

Page 3: Extending DSMS for  Data  Stream Mining

3

Many Research Projects …

Amazon/CougarAmazon/Cougar (Cornell) – sensors Aurora (Brown/MIT) – sensor monitoring,

dataflow Hancock Hancock (AT&T) – Telecom streams Niagara (OGI/Wisconsin) – Internet DBs & XML OpenCQ OpenCQ (Georgia) – triggers, view

maintenance Stream (Stanford) – general-purpose DSMS TapestryTapestry (Xerox) – pubish/subscribe filtering Telegraph (Berkeley) – adaptive engine for

sensors Gigascope: AT&T Labs – Network Monitoring Stream Mill (UCLA) - power & extensibility

Page 4: Extending DSMS for  Data  Stream Mining

4

Technology Challenges Data Models

Relational Streams--but XML streams important too Tuple Time-Stamping Order is important Windows

Query Languages: Extensions of SQL or XQUERY To support continuous (i.e., persistent) queries on transient

data—reversal of roles. Blocking operators excluded

Query Plans: New execution models (main memory oriented) Optimized scheduling for response time or memory

Quality of Services (QoS) & Approximation Synopses Sampling Load shedding.

Page 5: Extending DSMS for  Data  Stream Mining

5

Commercial Developments

Several Startups Streambase, Coral8, Apama, and Truviso.

Oracle and DBMS companies Publish/subscribe Complex Event Processing (CEP)

Limitations: only simple applications—e.g. continuous queries expressed in SQL No Support for Data Stream Mining queries.

Page 6: Extending DSMS for  Data  Stream Mining

6

Data Stream Mining

Many applications: click stream analysis, intrusion detection,...

Many fast & light algorithms developed for stream mining. Ensembles, Moment, SWIM, etc.

Analyst should be able to focus on high-level mining tasks. Leaving QoS and lower-level issues to the system.

Integration of mining methods into Data Stream Management Systems (DSMS) is required Many research challenges.

Stream Mill Miner (SMM) is the first DSMS designed for that.

Page 7: Extending DSMS for  Data  Stream Mining

7

Data Stream Management Systems (DSMS)

Data stream mining applications so far ignored by DSMS … although

A. DSMS technology is required for data stream mining QoS, query scheduling, synopses, sampling,

windows, ...

B. But supporting DM applications is difficult since current DSMS only support simple query languages based on SQL.

Conclusion: either a shotgun wedding ... or a research breakthrough is needed here!

Page 8: Extending DSMS for  Data  Stream Mining

8

A Difficult Problem: the Inductive DBMS Experience

Initial attempts to support mining queries in relational DBMS: Unsuccessful OR-DBMS do not fare much better [Sarawagi’ 98].

In 1996 the ‘high-road’ approach by Imielinski & Mannila who called for a quantum leap in functionality:

High-level declarative languages for DM .

Extensions for query processing and optimization.

The research area of Inductive DBMS was thus born Inspired DMQL, Mine Rule, MSQL, etc.

Suffer from limited generality and performance issues.

Page 9: Extending DSMS for  Data  Stream Mining

9

DBMS Vendors

Vendors have taken a `low-road’ approach. A library of mining functions using a cache-mining

approach

IBM DB2 Intelligent Miner Oracle Data Miner MS OLE DB for DM: mining models

Closed systems, Lacking in coverage and user-extensibility. Not as popular as dedicated, stand-alone mining

systems, such as Weka

Page 10: Extending DSMS for  Data  Stream Mining

10

Weka A comprehensive set of mining algorithms, and

tools.

Generic algorithms over arbitrary data sets. Independent on the number of columns in tables.

Open and extensible system based on Java.

These are the features that we want in our SMM—starting from SQL rather than Java!

Not an easy task ...why?

Page 11: Extending DSMS for  Data  Stream Mining

11

SMM Contributions

Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

Language and System Extensions: Genericity, Extensibility, and Performance

A suite of stream mining algorithms. Existing ones and Newly developed in this project—e.g., SWIM.

High level mining model for better Usability Control of mining process.

Page 12: Extending DSMS for  Data  Stream Mining

12

From SQL to Online Mining in SMM:step by step

Naïve Bayesian Classifier (NBC). Important and frequently used. Schema-specific NBC. Simple to express in SQL— by count, sum

aggregates. But a generci NBC is still preferable. Genericity: one function independent of number columns

involved. Schema independence in SQL?

Page 13: Extending DSMS for  Data  Stream Mining

13

Genericity

Weka Arrays of type real.

SMM Verticalization. Similar arrays, but in tables. Built-in table function to

reduce any table to this form.

Thus, generic UDAs work with this schema.

And further improvements are also supported in SMM

Page 14: Extending DSMS for  Data  Stream Mining

14

Extensibility?

Most mining tasks cannot be implemented in SQL.

Solution: Define complex functions by User Defined Aggregates (UDAs)

Complex mining tasks can be viewed as aggregates

UDAs Natively defined in SQL make the language computationally complete [Wang’ 04]

Turing-complete over static data

Non-blocking complete over data streams

Natural extensions to support windows and delta computations for data streams [Bai’ 06]

UDAs can be defined in a PL, for better performance

Page 15: Extending DSMS for  Data  Stream Mining

15

Windowed UDA Example – Continuous Count

WINDOW AGGREGATE sum(val REAL):REAL {TABLE state (tot real);INITIALIZE: {

INSERT INTO state VALUES(val);}ITERATE: {

UPDATE state SET tot = tot + val;}EXPIRE: {

UPDATE state SET tot = tot – oldest().val;}/* No TERMINATE state */

}

For efficient differential computation

Page 16: Extending DSMS for  Data  Stream Mining

16

Online Mining in SMM

UDAs Invoked with standard SQL:2003 syntax of OLAP functions.

SELECT learn(ts.Column, ts.Value, t.dec)

OVER (ROWS 1000 PRECEDING)

FROM trainingstream AS t,

TABLE (verticalize(Outlook, Temp, Humidity, Wind)) AS ts

Powerful framework: Concept drifts-shifts Association rule mining

Page 17: Extending DSMS for  Data  Stream Mining

17

The Slide Construct

A window can be divided into panes (called a slide)

Tumbling windows when the size of the slide is equal or larger than that of the window

The slide/window combination is great for data stream mining. Simple construct added to support slides in

UDAs Allowed us to build a flexible and efficient

library of data stream mining UDAs

Page 18: Extending DSMS for  Data  Stream Mining

18

SMM Contributions

Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

Language and System Extensions: Genericity, Extensibility, and Performance

A suite of stream mining algorithms. Existing ones and Newly developed in this project—e.g., SWIM.

High level mining model for better Usability Control of mining process.

Page 19: Extending DSMS for  Data  Stream Mining

19

Association Rule Mining

SWIM [Mozafari’ 08] – Maintaining frequent patterns over large windows with slides.

Differentially computes frequent patterns as slides enter (expire out of) the window.

Uses efficient ‘Verifiers’ based on conditional counting.

Trade-off between Delay and PerformancePerformance gain over existing algorithms.

Page 20: Extending DSMS for  Data  Stream Mining

20

SWIM (Sliding Window Incremental Miner)

If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all slides (PT)

S4… ……….S5 S6 S7

W4 W5

Expired New

PT

PT = F4 U F5 U F6

Count/Update frequencies

Mine

MiningAlg.Add F7 to PT

Count/Update frequencies

Prune PT

PT = F5 U F6 U F7

Page 21: Extending DSMS for  Data  Stream Mining

21

Concept Drifts/Shifts—Complex Processes

Ensemble based methods. Weighted bagging [Wang’ 03], adaptive boosting

[Chu’ 04], inductive transfer [Forman’ 06]. Generic support, e.g. adaptive boosting (below).

Page 22: Extending DSMS for  Data  Stream Mining

22

Built-in Online Mining Algorithms In SMM

Online classifiers Naïve Bayesian Decision Tree K-nearest Neighbor

Online clustering DBScan [Ester’ 96] IncDBScan Windowed K-means* DenStream* [Cao’ 06] CluStream

Association rule mining Approximate

frequent items SWIM [Mozafari’ 08] Moment [Chi’ 04] AFPIM

Time series/sequence queries SQL-TS [Sadri’ 01]

Many more …

Already supported To be supported

Page 23: Extending DSMS for  Data  Stream Mining

23

SMM Contributions

Build on Stream Mill DSMS and its SQL-based continuous query language and enabling technology.

Language and System Extensions: Genericity, Extensibility, and Performance

A suite of stream mining algorithms. Existing ones and Newly developed in this project—e.g., SWIM.

High level mining model for better Usability Control of mining process.

Page 24: Extending DSMS for  Data  Stream Mining

24

Usability?

Complex SQL queries to invoke built-in and user-defined mining algorithms. An open and extensible system

Most analysts would prefer using high-level mining language that supports uniform invocation of built-in and user-

defined mining algorithms (no SQL required) describes the workflow of the mining process Is also open and extensible to incorporate

newly defined mining algorithms.

Page 25: Extending DSMS for  Data  Stream Mining

25

Example: Defining a Mining Model

CREATE MODEL TYPE NaiveBayesianClassifier {SHAREDTABLES (DescriptorTbl),

Learn (UDA LearnNaiveBayesian,WINDOW TRUE,PARTABLES(), % names of param tables required by the method

PARAMETERS() % additional parameters to be specified for input

),Classify (UDA ClassifyNaiveBayesian,

WINDOW TRUE,PARTABLES(),PARAMETERS()

)};

Page 26: Extending DSMS for  Data  Stream Mining

26

Example: Using a Mining Model

Creating an instance:CREATE MODEL INSTANCE NaiveBayesianInstance

AS NaiveBayesianClassifier;

Uniform invocation of mining tasks:

RUN NaiveBayesianInstance.Learn WITH TrainingSet;

Page 27: Extending DSMS for  Data  Stream Mining

27

Performance

SMM Vs. Weka NBC and decision tree classifier

Datasets [UCI]• Iris: 5 attributes • Heart disease: 13 attributes

Overhead of integrating algorithms into SMM The SWIM algorithm standalone vs.

integrated Dataset [IBM Quest]

• Trans len 20, Pattern len 5, Tuples 50K

Page 28: Extending DSMS for  Data  Stream Mining

28

Comparison with Weka: NBC-Iris

Page 29: Extending DSMS for  Data  Stream Mining

29

Comparison with Weka: NBC-HD

Page 30: Extending DSMS for  Data  Stream Mining

30

Comparison with Weka: Decision Tree - Iris

Page 31: Extending DSMS for  Data  Stream Mining

31

Integration Overhead: Integrated SWIM vs. Standalone SWIM

Page 32: Extending DSMS for  Data  Stream Mining

32

The Stream Mill System

One server, multiple clients Server (on Linux): hosts the ESL language and manages storage

and continuous queries Client (Java based GUI): allows the user to specify streams,

queries, etc.

Page 33: Extending DSMS for  Data  Stream Mining

33

Conclusion

SMM integrates new solutions for several difficult problems: Usability by high-level mining models Extensibility by user-defined mining models that

call on UDAs with windows Suite of built-in data stream mining UDAs Generic mining UDAs by Verticalization & other

techniques Performance

SMM is the first of its kind: more and better systems will follow in its footsteps.

Page 34: Extending DSMS for  Data  Stream Mining

34

Future Work

Faster & lighter mining algorithms E.g. online algorithms for clustering

Integration of other mining algorithms

Data flow in mining modelsSimilar solution for databases

Page 35: Extending DSMS for  Data  Stream Mining

35

Thank you!

Page 36: Extending DSMS for  Data  Stream Mining

36

References

[Arasu’ 04] Arvind Arasu and Jennifer Widom. Resource sharing in continuous sliding-window aggregates. In VLDB, pages 336–347, 2004.

[Babcock’ 02] B. Babcock, S. Babu, M. Datar, R. Motawani, and J. Widom. Models and issues in data stream systems. In PODS, 2002.

[Bai’ 06] Yijian Bai, Hetal Thakkar, Chang Luo, Haixun Wang, and Carlo Zaniolo. A data stream language and system designed for power and extensibility. In CIKM, pages 337–346, 2006.

[Cao’ 06] F Cao, M Ester, W Qian, and A Zhou, Density-based Clustering over an Evolving Data Stream with Noise, To appear in Proceedings of SIAM 2006.

[Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz. Moment: Maintaining closed frequent itemsets over a stream sliding window. In Proceedings of the 2004 IEEE International Conference on Data Mining (ICDM’04), November 2004.

[Chu’ 04] F. Chu and C. Zaniolo. Fast and light boosting for adaptive mining of data streams. In PAKDD, volume 3056, 2004.

[Ester’ 96] Martin Ester, Hans-Peter Kriegel, Jorg Sander, and Xiaowei Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Second International Conference on Knowledge Discovery and Data Mining, pages 226–231, 1996.

[Forman’ 06] George Forman. Tackling concept drift by temporal inductive transfer. In SIGIR, pages 252–259, 2006.

Page 37: Extending DSMS for  Data  Stream Mining

37

References

[Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database perspective on knowledge discovery. Commun. ACM, 39(11):58–64, 1996.

[Law’ 04] Yan-Nei Law, Haixun Wang, and Carlo Zaniolo. Data models and query language for data streams. In VLDB, pages 492–503, 2004.

[Mozafari’ 08] Barzan Mozafari, Hetal Thakkar, and Carlo Zaniolo. Verifying and mining frequent patterns from large windows over data streams. In International Conference on Data Engineering (ICDE), 2008.

[Sadri’ 01] Reza Sadri, Carlo Zaniolo, Amir Zarkesh, and Jafar Adibi. Optimization of sequence queries in database systems. In PODS, Santa Barbara, CA, May 2001.

[Sarawagi’ 98] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. In SIGMOD, 1998.

[UCI-MLR] http://archive.ics.uci.edu/ml/datasets.html [Wang’ 03] H. Wang, W. Fan, P. S. Yu, and J. Han. Mining concept-

drifting data streams using ensemble classifiers. In SIGKDD, 2003.