1 cidr’03 aims: an immersidata management system cyrus shahabi computer science department &...

44
1 CIDR’03 AIMS: An Immersidata Management System Cyrus Shahabi Computer Science Department & Integrated Media Systems Center University of Southern California Los Angeles, CA 90089-0781 [email protected] http:// infolab.usc.edu

Post on 21-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

1CIDR’03

AIMS: An Immersidata Management System

Cyrus ShahabiComputer Science Department &Integrated Media Systems CenterUniversity of Southern California

Los Angeles, CA [email protected]

http://infolab.usc.edu

2CIDR’03

Outline

Definitions and Motivating Applications

Immersive Data Types (focus: immersidata)

AIMS Architecture

Subsystems: Acquisition, Storage & Querying

Current Status (demo, if time permits)

Conclusion and Future Work

3CIDR’03

Immersive Environments

Immersive Environments allow a user to become immersed within an augmented or virtual reality environment in order to interact with people, objects, places, and databases.

Examples Office of the Future (UNC) Fire Fighter Training System (Georgia Tech) Planetary Exploration (JPL) Physical/Occupational Therapy System (Haifa Univ.) Virtual Classroom and Office (USC IMSC) Haptic Museum (USC IMSC) MRE: Mission Rehearsal Exercise (USC ICT)

4CIDR’03

Thesis (1) It is absolutely critical to understand the data

generated by and for immersive environments For example, from the data acquired from a user’s

interactions with an immersive environment (i.e., immersidata), we can learn about the user’s behavior to: Study human factor issues Measure the effectiveness of the environment Customize the information delivery Identify pitfalls in the system Better understand the user’s intentions Improve the system performance

For immersive and multimedia community! For database community:

Immersive sensors are the user interfaces of the future; as a research community we should study their generated data or we will miss the boat.

5CIDR’03

Example: Immersive Sensor Data Streams

<Si, x, y, z, t, v>

6CIDR’03

command

PlayRunStop

Zoom-InZoom-Out

0.720.150.63

0.920.25

Immersive environment

RecognitionSystem

DB of Labeled Patterns

Application (1) : Immersive Sensor Pattern Recognition

On-Line Query & Analysis

7CIDR’03

AcquisitionModule

ImmersidataDatabase

Spatio-Temporal(moving sensors)Query Evaluation

i

i

2. Sensor valuessampled over time Recognition modules:

-SVD-Bayesian Classifiers-Neural Net

3. Semantic descriptionof hand

1. User makes ASL

signs w/ a glove

4. ASL signs recognized

C E F

Application (1) : American Sign Language (ASL) as well-defined

patterns

8CIDR’03

On-Line query and analysis challenges: A hand sign is composed of a sequence of data samples across

multiple sensor streams A sequence for one sign has no fixed length (i.e., can’t tell when

one ends and the other starts!)

Two problems (chicken & egg-problem) with interdependent solutions should be addressed• Isolate signs• Recognize the isolated sign

An example statement in American Sign Language (ASL)

like yellow shoesI

Application (1) : ASL On-Line Q&A …

9CIDR’03

Application (2) : Immersive Classroom

Off-Line Query & Analysis

Study attention performance for Normal & ADHD-Diagnosed Children

A classroom as a virtual environment (virtual students, a virtual teacher, desks, a blackboard, a window to the playground, doors)

Presence of distracters Paper airplane Ambient classroom noise Students walking Cars passing outside, visible through the window

10CIDR’03

Application (2) : IC Off-Line Q&A …

User, wearing HMD, is immersed into the class Trackers monitor body movements and stream data

to the database Task: pressing a button when a particular letter

pattern is seen on the virtual blackboard (e.g., AX)

Head sensor data

Arm sensor data

Leg sensor data

DB

Mouse Clicks

Displayed Characters

Distracters

11CIDR’03

Application (2) – IC Off-Line Q&A … Off-line query and analysis:

Range-sum queries

• Sum of body movements

• Average reaction time to the patterns

• Number of correct hits

Classification and clustering

• Use a classification technique to differentiate between normal and ADHD-diagnosed subjects (e.g., SVM)

Distinguishing hyperactive kids from normal by automatically analyzing tracker data: major impact in psychotherapy, able to discriminate and specify diagnosis in a manner not possible using existing traditional methods

Video Clip

12CIDR’03

Thesis (2)

Immersive applications in training and simulation

domains, share common data storage and

analysis requirements (i.e., dealing w/ sensor

data streams, aka immersidata)

Hence, instead of building customized systems

for the “acquisition, storage and querying”

needs of each immersive application, one can

design a general-purpose system addressing

many of the shared requirements

14CIDR’03

Focus: Immersidata [MIS’99]

Data acquired from user’s interaction with the immersive environment Subject body positions Subject recognized gestures

Can be analyzed to learn about user’s behavior Specifications

Multidimensional <si, x, y, z, t, v>

Spatio-Temporal Continuous Data Streams (CDS) Potentially large in size and bandwidth requirements Noisy

…, <sn,xn,yn,zn,hn,pn,rn,tn>, …, …,<s1,x1,y1,z1,h1,p1,r1,t1>, …

15CIDR’03

1. Acquisition module

DWPT basis selectionfor each dimension

Transformation

2. Storage module

Wavelets packinginto disk blocks or DB BLOBS

Immersidata storage(file-system + OR-DBMS)

4. Query & analysis module

Application-specificGUI

ProPolyne [web] services

Users statesand contexts

Sensor Data Streams

3. User interaction module

Pattern isolationheuristic

Pattern matching:SVD-based measure

AIMS: An Immersidata Management System

16CIDR’03

Challenges of AIMS Subsystems

Acquisition [SIGMETRICS’01,ICME’02] Data should be filtered and transformed (similar to signals) Database friendly signal processing techniques are required

Storage [SIGMOD’03?] Physical level of storage system should be designed to store

transformed data (e.g., wavelet coefficients)• Block allocation strategies considering query patterns

Offline Query and Analysis [EDBT’02.PODS’02] Approximate, progressive, and efficient polynomial analytical query

on large amount of multidimensional data Online Query and Analysis [MMM’03]

Common challenges with querying continuous data streams Real-time pattern recognition on aggregation of multiple data

streams that are incrementally completing Data from all streams form the meaningful data

17CIDR’03

1. Acquisition Module

Receive multidimensional sensor streams In real-time selects different basis per dimension

(optimally) from the DWPT (Discrete Wavelet Packet Transforms) library

Applies multidimensional transformation to data (generates multi-resolution representations of data)

NOTE: no compression is applied, no data will be lost by this process

• INPUT: Multidimensional streams• OUTPUT: Wavelet coefficients

Approaches:

18CIDR’03

2. Storage Module

Optimally packs related wavelet coefficients into disk blocks (to reduce future I/O cost) and store them in the file system or within OR-DBMS

Includes corresponding disk blocks info into the DBMS (Database Management System) for future queries

• INPUT: Wavelet coefficients• OUTPUT: disk blocks

metadata records

Approaches:

20CIDR’03

Optimal Disk Placement for Wavelet DataTiling - Blocking (Haar wavelets)

21CIDR’03

3. User Interaction Module

Receives data from various input-devices (beyond keyboard and mouse) used by the user (e.g., for data visualization purposes)

Understands the set of requested actions (SVD + mutual-information)

Translate actions to application-specific commands and/or database queries (takes user-profile & context into account)

Also stores a history of users interactions to be mined off-line and/or on-line to extract user state/behavior and application context to facilitate future interactions by the same user (e.g., personalization/customization)

• INPUT: Camera/speech/tracker/immersive-sensor• OUTPUT: application commands and queries

user profile/state and application context

Approaches:

22CIDR’03

4. Query & Analysis Module

Transforms queries into a consistent wavelet domain as of data Performs queries efficiently (and perhaps approximately or

progressively) in the wavelet domain Displays the correct resolution/granularity of aggregate

value(s) and/or events to the user based on user profile (e.g., tolerable latency time) and/or system requirements and/or data availability

An event is tagged with space (e.g., latitude, longitude and altitude), time and bag of attributes

• INPUT: Range and point queries• OUTPUT: Aggregate values/Integrated events

Approaches:

23CIDR’03

AIMS Main Theme: Data Manipulation, Query & Analysis

in the WAVELET Domain Main idea/distinction: storage is cheap and queries

are ad-hoc; let’s keep all the wavelet coefficients! (no data compression)

Intuition: At the data population time, we don’t know which coefficients are more/less important

• Different than the signal-processing objective to reconstruct the entire signal as good as possible

• This has been observed by [Garofalakis & Gibbons, SIGMOD’02], but they proposed other ways to drop coefficients assuming a uniform workload

Opportunity: At the query time, however, we have the knowledge of what is important to the pending query

24CIDR’03

Define range-sum query as dot product of query vector and data vector (also observed by [Gilbert et. al, VLDB’2001] but no query transformation)

Offline: Multidimensional wavelet transform of data

At the query time: “lazy” wavelet transform of query vector (very fast)

Dot product of query and data vectors in the transformed domain exact result

Choose high-energy query coefficients only fast approximate result (90% accuracy by retrieving < 10% of data)

Choose query coefficients in order of energy progressive result

AIMS Main Theme: Q&A of Wavelets

26CIDR’03

Current Status: ProPolyne Demonstration

27CIDR’03

1. Acquisition module

DWPT basis selectionfor each dimension

Transformation

2. Storage module

Wavelets packinginto disk blocks or DB BLOBS

Sensor Data storage(file-system + DBMS)

4. Query & analysis module

Application-specificGUI

ProPolyne [web] services

Users statesand contexts

<x, y, z, t, value>Remote Sensor Data Streams

<lat, long, altitude, t, temperature>

3. User interaction module

Pattern isolationheuristic

Pattern matching:SVD-based measure

AIMS with a Twist!

28CIDR’03

Conclusion and Future Work A new application domain, immersive applications, and one of its

data set, immersidata, were introduced Database challenges involved in managing immersidata

discussed: Some direct adoption of the typical database research techniques

(e.g., OLAP) Some modifications/extensions of the current research contributions

(e.g., in the area of data streams) that are not applicable immediately The design of AIMS, an innovative data systems architecture,

were reported Future Work

I/O efficient ways for Wavelet transformation and incremental update Hybrid sorting of both data and query coefficients Prototypical implementation of an end-to-end application using AIMS Performance evaluation

29CIDR’03

Application (3) – Physical/Occupational Therapy Both On-Line and Off-Line Q&A

Rehabilitation research using virtual environments and gaming technologies Enables individuals with severe physical disabilities to use their residual motor

abilities in more efficient and less fatiguing ways Patient watches her video projected on a 2-d virtual environment Video cameras track body movements Animated target characters are manipulated within the environment Patient is asked to hit the targets to gain more score Potential data analysis tasks

Offline analysis of user performance in order to find specific motor disabilities Online analysis of body movements to add more targets in the directions which need more

exercises

30CIDR’03

Thanks!

31CIDR’03

Haptic Data Acquisition [SIGMETRICS’01]

Temporal aspect: the rate of which the values of sensors should be sampled? Trade-off between ‘accuracy & bandwidth utilization

Fixed Sampling: Sampling at a constant rate; max value of speed is a

function of system speed and/or haptic glove Group Sampling:

Intuitive grouping of sensors; different sampling rate for each group

Adaptive Sampling: Dynamic sampling; within a window of session, every

sensor sampled at an individual optimal rate

32CIDR’03

ProPolyne Features

“Measure” can be any polynomial on any combination of attributes Can support COUNT, SUM, AVERAGE Also supports Covariance, Kurtosis, etc. All using one set of pre-computed aggregates

Independent from how well the data set can be compressed/approximated by wavelets Because: We show “range-sum queries” can always be

approximated well by wavelets (not always HAAR though!)

Low update cost: O(logd N) Can be used for exact, approximate and progressive

range-sum query evaluation

33CIDR’03

Polynomial Range-Sum Queries Polynomial range-sum queries: Q(R,f,I)

I is a finite instance of schema F R SubSetOf Dom(F), is the range f : Dom(F) R is a polynomial of degree

RIxxfIfRQ

)(),,(

IRx

kKxIRQ

xxfCOUNT

2)58,30(1)55,28(1)(1),1,(

1)(1)(:

Example: F = (Age, Salary) R: (25 < age < 40) & (55k < salary < 150k)

Age Salary

25 $50k28 $55k30 $58k50 $100k55 $130k57 $120k

I

IRx

kksalaryKsalaryxfIsalaryRQ

xsalaryxfSUM

113)58,30()55,28()(),,(

)()(:

2^)),1,((

),,(),,(

),1,(

),,(),(

3280)58,30()55,28()()(),,(

IRQ

IsalaryRQIageRQ

IRQ

IagesalaryRQsalaryageCov

MkfKfxagexsalaryIagesalaryRQIRx

34CIDR’03

Polynomial Range-Sum Queries as “Vector Queries”

The data frequency distribution of I is the function I : Dom(F) Z that maps a point x to the number of times it occurs in I

To emphasize the fact that a query is an operator on the data frequency distribution, we write

Example: (25,50)=(28,55)=…=(57,120)=1 and (x)=0 otherwise.

),,(),,( IfRQIfRQ

Age Salary

25 $50k28 $55k30 $58k50 $100k55 $130k57 $120k

I

)(

)()()(),,(FDomx

II xxxffRQ R

1)( xR Rx

0)( xR Rx where:

if

if

Hence:

II RffRQ ,),,( Or:Vector Query query data

35CIDR’03

Ha[i]’s Ga[i]’s

a[i]’sji 20

120 ji

H2a [i]’s GHa[i]’s

H3a[i]’s GH2a[i]’s

220 ji

320 ji

H operator: computes a local average of array a at every other point to produce an array of summary coefficients: Ha

Example (Haar) h=[1/2,1/2]

G operator: measures how much values in the array a vary inside each of the summarized blocks to compute an array of detail coefficients: Ga

Example (Haar) g=[1/2,-1/2]

Overview of Wavelets

a

DWT of a

Summary coefficientsof a at level 2Detail coefficients

of a at level 2

aka wavelet coefficients of a

][ˆ][ˆ][][ baibia

36CIDR’03

Naive Evaluation of Vector Queries Using Wavelets

Hence, vector queries can be computed in the wavelet-transformed space as:

Algorithm: Off-line transformation of data vector (or “data distribution function”, i.e., , to

be exact)

• O (|I|ldlogdN) for sparse data, O (|I|) = Nd for dense data

Transform the query vector at submission

• O (Nd) !

Sum-up the products of the corresponding elements of data and query vectors

• Retrieving elements of data vector: O (Nd) !

1

0,...,

1010

10

),...,(ˆ),...,(ˆ)ˆ,ˆ(),,(N

dd

d

RR fffRQ

37CIDR’03

Fast Evaluation of Vector Queries Using Wavelets

Main intuitions: “query vector” can be transformed quickly because

most of the coefficients are known in advance “Transformed query vector” has a large number of

negligible (e.g., zero) values (independent on how well data can be approximated by wavelet)

Example: Haar filter & COUNT function on R=[5,12] on the domain of integers from 0 to 15:

}0,2

1,0,0,0,

2

1,0,0,

2

1,0,

2

1,0,

22

3,

22

3,

2

1,2{ˆ

}0,0,0,1,1,1,1,1,1,1,1,0,0,0,0,0{

R

R

GaGHaGH2aGH3aH4a At each step, you

know the zeros

38CIDR’03

Exact Evaluation of Vector Queries

Query:SUM(salary) when (25 < age < 40) & (55k < salary < 150k)

# of Wavelet Coefficients: 837# of Nonzero Coordinates: 4380

39CIDR’03

Approximate Evaluation of Vector Queries

40CIDR’03

Optimal Disk Placement for Wavelet Data

The goal is to efficiently store wavelet coefficients Efficiently means fast access to stored data, low I/O

complexity, little disk access How to achieve this: create a principle of locality of

reference Designed for wavelet overlap queries, but can be extended

for polynomial range-sum queries over multidimensional data

41CIDR’03

Optimal Disk Placement for Wavelet DataDiscrete Wavelet Transform

x0 x1 x3 x4 x5 x6 x7x2

0 1 3 4 5 6 72

DWT

Time Domain

Wavelet Domain(coefficients)

42CIDR’03

SVD Background

The idea of SVD is based on the following theorem of linear algebra: If matrix , then there exist column-orthonormal

matrices U and V such that where and

, and is a diagonal matrix

such that

nmRX TVAUX

rmRU nrRV rrRA ),...,,( 21 paaadiagA

paaa ...21

43CIDR’03

Weighted-Sum SVD

Each data sequence could be represented as a matrix, where the columns (r) are the sensors and hence their # is fixed

The similarity metric of two data sequences is defined on the ‘square’ matrices To eliminate the effect that the number of

rows (i.e., the time dimension) in the two matrices are different (i.e., multiply the matrix by its transpose matrix)

44CIDR’03

Weighted-Sum SVD

Problem: Obtain the similarity of input sequence and the pattern

q11 q1r

qr1 qrr

p11 p1r

pr1 prr

square

square

SVD decompose

SVD decompose

e1, e2, … , er ×c1

cr

c2 ×

e1

e2

er

f1, f2, … , fr ×d1

dr

d2 ×

f1

f2

fr

weight

cw1

cwr

cw2

cw1+cw2+…+ cwr=1

dw1

dwr

dw2

dw1+dw2+…+ dwr=1

45CIDR’03

Weighted-Sum SVD

Problem: Obtain the similarity of input sequence and the pattern

e1, e2, … , er

e1

e2

er

cw1

cwr

cw2 f1, f2, … , fr

f1

f2

fr

dw1

dwr

dw2

r

iiii

r

iiii

fedw

fecw

12

11

The similarity of input sequence

and the pattern

=min(Θ1, Θ2)

46CIDR’03

The Ridge-Climbing Heuristic

Procedure: Compute the accumulated similarity values (ASVs)

between the input sequence and all vocabulary sequences

Keep track of all ASVs For each vocabulary sequence, check whether the ASV

is monotonically increasing, and whether a maximum is reached

• Yes: put this vocabulary into the candidates pool Choose the vocabulary from the candidates pool with

biggest maximal value Isolate the recognized stream

47CIDR’03

The Ridge-Climbing Heuristic

Assume the database only has three vocabulary sequence, like, yellow, and I.

like

ASV

s

time

ASV

s

time

ASV

s

time

yellow I

Maximum is reached!Isolate!Reset the ASVs

like

Input sequence