magdiel gal á n cse591: datamining dr. huan liu spring 2004

33
Synthesis of Streaming Data from Multiple Sensors via Embedded Data Extraction April 15 th , 2004 Project Report Magdiel Galán CSE591: DataMining Dr. Huan Liu Spring 2004 http://www.public.asu.edu/~mgalan/ StreamProjApr15.ppt

Upload: leverett-reynaud

Post on 31-Dec-2015

27 views

Category:

Documents


1 download

DESCRIPTION

Synthesis of Streaming Data from Multiple Sensors via Embedded Data Extraction April 15 th , 2004 Project Report. Magdiel Gal á n CSE591: DataMining Dr. Huan Liu Spring 2004. http://www.public.asu.edu/~mgalan/StreamProjApr15.ppt. Outline. Problem/Project Description Sampling Smoothing - PowerPoint PPT Presentation

TRANSCRIPT

Synthesis of Streaming Data from Multiple Sensors via Embedded Data

Extraction

April 15th, 2004 Project Report

Magdiel Galán

CSE591: DataMiningDr. Huan LiuSpring 2004

http://www.public.asu.edu/~mgalan/StreamProjApr15.ppt

Outline Problem/Project Description Sampling Smoothing Clustering Current Status Plans

Project Description Synthesis of Streaming Data from

Multiple Sensors (~100’s) via Embedded Data Extraction for mission critical applications.

Work in conjunction with Motorola’s Human Interface Lab (on-going project) Simulation Environment

Project Description

Goal: Develop driver assistance system that provide feedback, but not control, during unsafe instances.

From distractions caused by cellphones, PDAs, eMail, Why: Targeting a government initiative to create a

safer car environment in the information age explosion

How: Develop intelligent system by mining Streaming Data from multiple automotive sensors

Development work being done using driving simulator with projections screens with up to 400 parameters/sensors including video links for eye-gaze and foot-pedal movement

Sample Cases Case Scenario #1:

Passing Slow Traffic which slowed down due to an accident

which you are also rubber-necking while fidgetting with your radio

Case Scenario #2: Making a left turn

while hearing directions from MapTracker while checking at the time because you are late

while reaching for the cellphone with on-coming call

Simulation Environment

150 Simulated View

Driving Experience

GasGas

EngineTempBatt

Oil

PDA

GearShift

CD

CellPhone

A/C

Air Bag

Acceleration

Lateral Acc.

Sonar Proximity Sensor

Wheel Rotation Brake Pressure

RPMs

GPS Internet

Driver

Motivation Primary Interest: Robotics

Merging of Sensors/Sensor Fusion optical proximity (IR, sonar, radar) location (GPS, visual maps) movement (actuators, rotations) system (battery, temperature, bump switches)

Problem: decide agent’s next best action vs. a goal

Not too dissimilar from an Automobile environment Other Applications:

Manufacturing Environment Increase Yields/Productivity/Reduce Defects using quality

control daily monitor data (100’s Parameters 1K’s) Pentium Ex.: Oxide Thickness, Poly Width, Boron

Implant Density, Plasma Etch eV’s, Litho PM, Diffuser RPMs, etc…

Stream Data Properties Numerical/Continuous

Speed Steering/Heading Acceleration (Forward/Lateral) Distance (Lane Edge, Vehicle on Front)

Categorical Lane Position Gear: P/R/D/OD/L1/L2 Headlights On/Off Radio/CD ON Incoming Call

Sampling Rate: 60Hz

Critical/Special Conditions

Left/Right Turn Passing/Changing Lanes U-Turn Reverse Tailgating Not On Road

Some Warning Signs Lane Drifting Erratic Behavior

droopy eyes eyes not facing the road foot/pedal movement do not correspond

with road conditions Incoming Call while performing

Critical Maneuver

Goal

Identify Instances outside normal patterns as an indication of an Abnormal Situation Hence – Need to draw Driver’s Attention

to Impending Situation Ultimate Goal:

Develop bootsrapping mechanism that combines driving situation classifiers (i.e. LeftTurn/Passing) together with instance selection methods in active learning

Bootsrapping – selecting high utility data for re-training

Instance Selection Properties Instance representative Instance selection reduce rows Ideal outcome instance selection

choose a data subset achieves same result as whole data with little or no performance PP deterioration

Should be model independent ∆ ∆ P(MP(Mii) ≐ ∆P(M) ≐ ∆P(Mjj))

[LM01]

Problem#1: Sampling

Initial step towards instance selection: select representative subset… Divide into collection of elements which

must cover the whole population without overlapping [GHL01]

These are called sampling units

Sampling Results

Sampling at 10mS (x-axis: signal duration; y-axis: count)

Problem#2: Smoothing Reduce/Filter out noise and outliers. Smoothing Techniques used:

Bin Median/Rolling Average [LM01]/[D03] Median preferred over Mean since less

sensitive to outliers Tresholding/Bin Boundaries

[LM01]/[HK01] 10% offset treshold

PreSmoothing - RAW Data

x-axis: driving time elapsed in minutes

y-axis: speed(km/h); steering(degrees), heading(degrees)

RAW Data Map/Course

Route Map – starting point at (0,0)

Smoothing Results - Median

x-axis: driving time elapsed in minutes

y-axis: speed(km/h); steering(degrees), heading(degrees)

Smoothing Results - Median

Smoothing Results - Threshold

Smoothing Results - Threshold

Dr. Liu’s Incremental Instance Selection AlgorithmGiven: Data streams with instances IOutput: indicative instances

For each data streamDo the following incrementally Create a profile P for I Check new instance i against P if i is an outlier of P

Return i else

Update P with iEnd do

Outliers

Problem#3: Clustering Why?

Data is Unclassified Previous results using Numerical Data on

most significant key parameters Develop clusters exemplifying ALL

attributes Select instances that do not belong to a

cluster as triggering mechanism

Stream Clustering Challenges Large “Unclassified” Data Base Fast On-Line Resolution within small

window 0.5 – to 2 or 3 seconds

One Pass Only restriction (need fast I/O) Mix of Numerical and Categorical Data

Traditional algorithms do not work well for categorical attributes (remember P/R/D/OD/L1/L2, or CD On)

Centroid approach cannot be used Hard to reflect the properties of the neighborhood of

the points

Memory Constraints

Clustering Techniques vs. Streaming Data SVM

Good at handling multidimensional data Not good – need classified data, lots of

I/O, data in memory BIRCH

Good at handling mulidimensional data, large databases; single scan, linear I/O time

Not good – predominantly for “numerical” type of attributes; order dependent

Clustering Techniques vs. Streaming Data (2)

CURE (Clustering Using REpresentative)[D03] Good at handling outliers; hierarchical Not good – random sampling (won’t fit

streaming) ROCK (RObust Clustering Using LinKs)

[D03] Good at Hierarchical clustering for

categorical attributes Not good: Random sampling for scale up

My 1st Clustering Attempt…

Move in Reverse

My 1st Clustering Attempt(2)

Zoom Next Page

My 1st Clustering Attempt(3)

Move in Reverse

Current Status/Plans This is an ON-GOING project Cluster Technique Development

Evolve from known methods? Generalization of the technique

Not just Automobile Streaming Data

References [LM01] H.Liu, H. Motoda. “Data Reduction via Instance Selection”.

Instance Selection and Construction for Data Mining. 2001. KAP. ASU Library

[GHL01] B. Gu, F.Hu, H. Liu. “Sampling: Knowing Whole From its Part”. Instance Selection and Construction for Data Mining. 2001. KAP. ASU Library

[HK01] J. Han, M. Kamber. Data Mining Concepts and Techniques. Chps. 3, 8 Data Cleaning, Clustering. Morgan Kaufman. ASU Library

[D03] M.Dunham. Introductory and Advanced Topics. Prentice Hall, Chps. 3-5. Mining Techniques, Classification, Clustering. ASU Library