argus: a prototype stream anomaly monitoring system thesis proposal chun jin thesis committee jaime...

42
ARGUS: A Prototype Stream Anomaly Monitoring System Thesis Proposal Chun Jin Thesis Committee Jaime Carbonell (Chair) Christopher Olston Jamie Callan Phil Hayes, DYNAMiX Technologies

Post on 20-Dec-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

ARGUS: A Prototype Stream Anomaly Monitoring System

Thesis Proposal

Chun Jin

Thesis CommitteeJaime Carbonell (Chair)Christopher OlstonJamie CallanPhil Hayes, DYNAMiX Technologies

Chun Jin Carnegie Mellon 2

Thesis Statement Stream Anomaly Monitoring System (SAMS) is

an important sub-class of stream applications. The difficulty is raised by the very-large-volume data and a large number of queries the system is supposed to handle.

Propose an approach for SAMS’s that implements incremental evaluation schemes with adapted Rete algorithm upon a traditional DBMS platform and exploit SAMS characteristics for query evaluation optimization.

Demonstrate how the approach and the improvements could lead to a simple and fast implementation of an effective and efficient SAMS system.

Chun Jin Carnegie Mellon 3

Outline Motivation My ARGUS Approach Current Work Status

Current System Preliminary Results

Proposed Work and Timeline

Chun Jin Carnegie Mellon 4

Stream Processing Stream Processing Applications

Network Traffic Analysis and Router Configuration

Internet Services Sensor Data Analysis Anomaly Detection

Stream Processing Projects STREAM, TelegraphCQ, Aurora NiagaraCQ, OpenCQ, WebCQ Gigascope, Tribeca Tapestry, Alert, Tukwila, etc.

Chun Jin Carnegie Mellon 5

Stream Anomaly Monitoring Systems (SAMS)

SAMS monitors structured data streams for anomalies or potential hazards.

Continuous queries may number in thousands or tens of thousands.

Daily stream volumes may exceed millions of records.

Satisfaction of a SAMS query is often rare (very-high-selectivity).

Chun Jin Carnegie Mellon 6

SAMS Dataflow

Analyst

Stream Anomaly Monitoring System

Stream Anomaly Monitoring System

Storage

Que

ries

Alerts

Data Streams

FedWire Money TransfersPatient Records

Chun Jin Carnegie Mellon 7

Query Example 4 Suppose for every big transaction of

type code 1000, the analyst wants to check if the money stayed in the bank or left within ten days. An additional sign of possible fraud is that transactions involve at least one intermediate bank. The query generates an alarm whenever the receiver of a large transaction (over $1,000,000) transfers at least half of the money further within ten days of this transaction using an intermediate bank.

Chun Jin Carnegie Mellon 8

SQL Query for Example 4FROM transaction r1, transaction r2, transaction r3WHERE r2.type_code = 1000 AND

r3.type_code = 1000 ANDr1.type_code = 1000 ANDr1.amount > 1000000 ANDr1.rbank_aba = r2.sbank_aba ANDr1.benef_account = r2.orig_account ANDr2.amount > 0.5 * r1.amount ANDr1.tran_date <= r2.tran_date ANDr2.tran_date <= r1.tran_date + 10 ANDr2.rbank_aba = r3.sbank_aba ANDr2.benef_account = r3.orig_account ANDr2.amount = r3.amount ANDr2.tran_date <= r3.tran_date ANDr3.tran_date <= r2.tran_date + 10;

Chun Jin Carnegie Mellon 9

ARGUS as a Prototype SAMS Implement the Adapted Rete Algorithm upon a

traditional DBMS platform Rete (Forgy 1982): Incremental Evaluation based

on Materialized Intermediate Results. SAMS’s assumption of very-high-selectivity query

over very-large-volume data justifies employment of Rete and necessitates some unique improvements.

Transitivity Inference Ono/Lohman VLDB90, Pirahesh/Leung/Hasan ICDE97

Predicate Set Evaluation and Materialization Partial Rete (Materialization skipping) Complex Common Computation Identification for Sharing Intermingled Sharing and Optimization processing

Chun Jin Carnegie Mellon 10

ARGUS System Architecture

Rete NetworkGenerator

Query

ReteNetworks

Data Tables

Analyst

Identified Threats

IntermediateTables

Data Streams

QueryTable

StreamAnomalyMonitoring

Do_queries

Scheduler

Chun Jin Carnegie Mellon 11

ReteGenerator Architecture

SystemCatalog

TopologyTable

History-basedRete Optimizer

ReteGenManager

QueryRewriter

TopologyChecker

TransitivityInference Counter

Table

SQL Queries

Check TopologyRegister Rete Networks

Update Tables

History-based Cost Estimating

Sharing

ReteGenerator

Chun Jin Carnegie Mellon 12

Selected ARGUS Topics Adapted Rete Algorithm

ReteGenerator translates a query into a Rete network that is wrapped as a stored procedure.

The procedure implements the Adapted Rete Algorithm accounting for the incremental evaluation

Transitivity Inference Rete Optimization Computation Sharing

Chun Jin Carnegie Mellon 13

Adapted Rete Algorithm (Selection)

n and m are old data sets Δn and Δm are the new much

smaller incremental data sets. Selection ơ

ơ(n+ Δn) ơ(n) ơ(Δn)= +

Chun Jin Carnegie Mellon 14

Adapted Rete Algorithm (Join) Join (n+Δn) (m+Δm)

= n m + Δn m + n Δm + Δn Δm

When Δn and Δm are very small compared to n and m, time complexity of incremental join is O(n+m)

Old ResultsNew Incremental Results

Chun Jin Carnegie Mellon 15

Incremental Evaluation in Rete Example 4

DataTable

r1, r2, r3

Type_code=1000Amount>1000000

Type_code=1000

Type_code=1000

r1.rbank_aba = r2.sbank_abar1.benef_account = r2.orig_accountr2.amount > r1.amount*0.5r1.tran_date <= r2.tran_dater2.tran_date >= r1.tran_date+10

r2.rbank_aba = r3.sbank_abar2.benef_account = r3.orig_accountr2.amount = r3.amountr2.tran_date <= r3.tran_dater3.tran_date >= r2.tran_date+10

Chun Jin Carnegie Mellon 16

Complex Queries A continuous query may contain multiple

SQL statements, and a single SQL statement may contain unions of multiple SQL terms.

Each SQL term is mapped to a sub-Rete network.

These sub-Rete networks are then connected to form the statement-level sub-networks.

And the statement-level subnetworks are further connected based on the view references to form the final query-level Rete network.

Chun Jin Carnegie Mellon 17

Transitivity Inference Exploring transitivity properties of

comparison operators To derive hidden high-selective

selection predicates High-selective selection predicates can

significantly improve performance as they may produce very small intermediate results. Subsequent join could be performed very fast on the materialized intermediate results.

Chun Jin Carnegie Mellon 18

Transitivity Inference Example Given

r1.amount > 1000000 and r2.amount > r1.amount * 0.5 and r3.amount = r2.amount

r1.amount > 1000000 is very high-selective on r1

We can infer high-selective predicates: r2.amount > 500000 r3.amount > 500000

Chun Jin Carnegie Mellon 19

Rete Optimization

Active List

Join Graph

StructureBuilder

JoinEnumerator

History-basedCost Estimator DB

SQL Query

Rete network

Update Tables

History-basedRete Optimizer

Chun Jin Carnegie Mellon 20

Join Graph Example

1

2

3

4

P(1,2)

P(2,3)

P(1,3)

P(3,4)

1,2

Chun Jin Carnegie Mellon 21

History-based Cost Estimator Run sub-plans on historical data To estimate the costs of sub-plans on future

data Assume same data distribution in past and future Apply heuristic functions to avoid estimating

extremely high cost sub-plans. Justify History-based Cost Estimator

Compiled and optimized once, and executed multiple times

Tolerable to spend more time on the one-time optimization

Accurate cost estimates compensate as queries run more and more times

Chun Jin Carnegie Mellon 22

Computation Sharing Predicate Indexing Extended predicate set operations Sharing Algorithm

Chun Jin Carnegie Mellon 23

Predicate Indexing Predicate Indexing Concepts:

Equivalent Predicate, p1 ≡ p2, iff ∀D, p1(D) = p2(D)

Equivalent Predicate Class Canonical Predicate Form

Predicates are converted into the canonical forms and stored as records in tables.

Searching a predicate becomes data retrieval from tables.

Chun Jin Carnegie Mellon 24

Relationship between Predicate Sets and Their Result Tuple Sets Predicate Set: a set of conjunctive

predicates Its Result Tuple Set: a set of database

tuples that satisfy all the predicates of the Predicate Set.

Fix database status D, a mapping from predicate set P to its result tuple set SD(P): SD: P ---> SD(P)

Predicate sets and their result tuple sets are complementary: Predicates are filters of data items The more number of predicates, the less

number of result tuples

Chun Jin Carnegie Mellon 25

Extending Predicate Set Operations

Defined on predicate sets Definitions are justified by the

relationships among corresponding result tuple sets

Important to common computation identification

Chun Jin Carnegie Mellon 26

Semantic Subset ⊆≡

Given two predicate sets P1 and P2, we say that P1 is a semantic subset of P2, and denote as P1⊆≡P2, if for any database status D, we have SD(P1)⊇SD(P2).

Chun Jin Carnegie Mellon 27

Semantic Subset Example

p1: t1.a>1, p2: t1.a>2 P1 = {p1}, P2 = {p2} S(P1)⊇S(P2),

P1⊆≡ P2. Why?

P2 ≡≡ {p1, p2}

Chun Jin Carnegie Mellon 28

Sharing TypesT1

T2 POT2

POT1

PFJ

POJ-PFJ

)(

)(

)(

22

11

OTNT

OTNT

FJNJ

PP

PP

PP

T1

T2 POT2

POT1POJ

PNT1-POT1

PNJ

PNT2-POT2

T1

T2 POT2

POT1 POJ

)(

)(

22

11

OTNT

OTNTNJ

PP

PPP

T1

T2 POT2

POT1

POJ PNJ-POJ

Non-change Add-only

Reconstruction

Selection Add-only

Chun Jin Carnegie Mellon 29

Sharing Algorithm Overview Non-change sharing. Add-only sharing. Optimizing the remaining query. Reconstruction and selection

sharing. Constructing the remaining Rete

network based on the optimized plan with possible sharing.

Chun Jin Carnegie Mellon 30

Current Work Status A preliminary system

Database A preliminary ReteGenerator

With the Adapted Rete and Transitivity Inference Will be expanded to incorporate optimization,

computation sharing, and incremental aggregation, etc.

A Preliminary evaluation Will conduct full evaluation on the complete

system in future

Chun Jin Carnegie Mellon 31

Preliminary Evaluation:Queries and Data

7 queries on synthesized FedWire money transfer database. 320006 records.

Two Data Conditions: Data1: Old: first 300000 records

New: remaining 20006 recordsALERT

Data2: Old: first 300000 recordsNew: next 20000 recordsNOT alert

Chun Jin Carnegie Mellon 32

Preliminary Results

Rete with Transitivity Inference

0

10

20

30

40

50

Q1 Q2 Q3 Q4 Q5 Q6 Q7

Ex

ecu

tio

n T

ime

(s)

Rete Data1 SQL Data1 Rete Data2 SQL Data2

Chun Jin Carnegie Mellon 33

Transitivity Inference

Q2

Q4

0

5

10

15

20

25

Data1 Data2

Exe

cuti

on

Tim

e(s)

05

101520253035404550

Data1 Data2

Ex

ec

uti

on

Tim

e(s

)

Rete TI Rete Non-TI SQL Non-TI SQL TI

Chun Jin Carnegie Mellon 34

Partial Rete Generation

Q4 assumes Transitivity Inference not applicable

05

101520253035404550

Data1 Data2

Ex

ecu

tio

n T

ime

(s)

Partial Rete

Rete

SQL

Chun Jin Carnegie Mellon 35

Proposed Work System Design and

Implementation System Evaluation

Chun Jin Carnegie Mellon 36

System Design and Implementation Rete Optimization (am doing) (05–08/2004) Computation Sharing (will do) (07–11/2004) Incremental Aggregation (will do) (12/2004–

02/2005) Constraint Exploiting (optional) (04–05/2005) Transitivity Inference Enhancements

(optional) ( 06 – 08/2005) Automatic Index Selection (optional) (09–

12/2005)

Chun Jin Carnegie Mellon 37

System Evaluation Data Collection ( 12/2004 – 01/2005) Query Generation ( 12/2004 – 01/2005) Simulation and Evaluation ( 02 – 05/2005)

Single SQL vs. Single Rete, Multiple SQL vs. Multiple Shared Optimized Rete

Single Non-optimized Rete vs. Single Optimized Rete

Multiple Non-shared Optimized Rete vs. Multiple Shared Optimized Rete

Non-incremental Aggregation vs. Incremental Aggregation

Chun Jin Carnegie Mellon 38

Evaluation: Data Collection FedWire Money Transfer Transactions

Synthesized 0.5M records. Plan to generate 0.5M more. 23 attributes/record

Massachusetts Medical Data Real 1.6M records (sanitized) 70 attributes/record In-patient admission and discharge

records. Expand to 10M.

Chun Jin Carnegie Mellon 39

Evaluation: Queries Now, 7 queries on FedWire, 3 queries on

Medical. Plan to extend to 20-40 queries for each

domain. Further extend query sets:

Similar predicates matching different constants Join predicate sets have non-empty

intersections Same where_clauses but different

groupby_clauses Same where_clauses and groupby_clauses but

different aggregation operators

Chun Jin Carnegie Mellon 40

Timeline System Design and Implementation

(Required) 03/2004 – 02/2005 System Implementation (Optional)

04/2005 – 12/2005 Evaluation on Required Parts 12/2004 –

05/2005 Thesis Writing and Defense 06/2005 –

03/2006 Thesis Writing 06 – 12/2005 Thesis Finalizing 01 – 03/2006 Defense 02 or 03/2006

Chun Jin Carnegie Mellon 41

ARGUS Summary Implement the incremental evaluation

schemes with the Adapted Rete Algorithm upon a traditional DBMS platform

To deal with very-large-volume data, exploit the very-high-selectivity query property for optimization: Transitivity Inference Predicate Set Evaluation and Materialization Partial Rete (Materialization skipping) Complex Common Computation Identification

for Sharing Intermingled Sharing and Optimization

processing

Chun Jin Carnegie Mellon 42

Thank you!

Questions and Comments?