uci database group quality aware sensor database (quasar) project** sharad mehrotra department of...
Post on 13-Dec-2015
217 Views
Preview:
TRANSCRIPT
UCI Database Group
Quality Aware Sensor Database (QUASAR) Project**
Sharad Mehrotra
Department of Information and Computer Science
University of California, Irvine
**Supported in part by a collaborative NSF ITR grant entitled “real-time data capture, analysis, and querying of dynamic spatio-temporal events” in collaboration with UCLA, U. Maryland, U. Chicago
UCI Database Group
Talk Outline
• Quasar Project– motivation and background– data collection and archival components – query processing – tracking application using QUASAR framework– challenges and ongoing work
• Brief overview of other research projects– MARS Project - incorporating similarity retrieval and refinement
over structured and semi-structured data to aid interactive data analysis/mining
– Database as a Service (DAS) Project - supporting the application service provider model for data management
UCI Database Group
Emerging Computing Infrastructure…
Instrumented wide-area spaces
In-body, in-cell, in-vitro spaces • Generational advances to computing infrastructure– sensors will be everywhere
• Emerging applications with limitless possibilities– real-time monitoring and control,
analysis
• New challenges – limited bandwidth & energy – highly dynamic systems
• System architectures are due for an overhaul– at all levels of the system OS,
middleware, databases, applications
Immediate vicinity area boundary(single-hop)
Roadside Base station
To the fixed Infrastructure ( Internet)
Ad hoc (802.11) link
Cellular (CDPD?) link
Immediate vicinity area boundary(single-hop)
Roadside Base station
To the fixed Infrastructure ( Internet)
Ad hoc (802.11) link
Cellular (CDPD?) link
UCI Database Group
Impact to Data Management …
• Traditional data management– client-server architecture– efficient approaches to data storage & querying – query shipping versus data shipping– data changes with explicit update
• Emerging Challenge– data producers must be considered as “first class”
entities• sensors generate continuously changing highly dynamic
data• sensors may store, process, and communicate data
Data/query request
Data/query result clientserverData producers
UCI Database Group
Data Management Architecture Issues
• Where to store data?– Do not store -- stream model
• not suitable if we wish to archive data for future analysis or if data is too important to lose
– at the producers• limited storage, network, compute resources
– at the servers• server may not be able to cope with high data production
rates. May lead to data staleness and/or wasted resources
• Where to compute?– At the client, server, data producers
Data/query request
Data/query result client
serverData producers
producer cache
UCI Database Group
Quasar Architecture
• Hierarchical architecture– data flows from producers to
server to clients periodically– queries flow the other way:
• If client cache does not suffices, then
• query routed to appropriate server• If server cache does not suffice, then
access current data at producer
– This is a logical architecture-- producers could also be clients.
server
clientClient cache
Server cache & archive
producer cache
dat
a fl
ow
Qu
ery
flow
producer
UCI Database Group
Quasar: Observations & Approach
• Applications can tolerate errors in sensor data– applications may not require exact answers:
• small errors in location during tracking or error in answer to query result may be OK
– data cannot be precise due to measurement errors, transmission delays, etc.
• Communication is the dominant cost – limited wireless bandwidth, source of major energy drain
• Quasar Approach– exploit application error tolerance to reduce communication between
producer and server
– Two approaches • Minimize resource usage given quality constraints • Maximize quality given resource constraints
UCI Database Group
Quality-based Data Collection Problem
• Let P = < p[1], p[2], …, p[n] > be a sequence of environmental
measurements (time series) generated by the producer, where
n = now
• Let S = <s[1], s[2], …, s[n]> be the server side representation of
the sequence
• A within- quality data collection protocol guarantees that
for all i error(p[i], s[i]) <
is derived from application quality tolerance
Sensor time series…p[n], p[n-1], …, p[1]
UCI Database Group
Simple Data Collection Protocol
• sensor Logic (at time step n)
Let p’ = last value sent to server
if error(p[n], p’) >
send p[n] to server
• server logic (at time step n)
If new update p[n] received at step n
s[n] = p[n]
Else
s[n] = last update sent by sensor
– guarantees maximum error at server less than equal to
Sensor time series…p[n], p[n-1], …, p[1]
UCI Database Group
Exploiting Prediction Models
• Producer and server agree upon a prediction model (M, )
• Let spred[i] be the predicted value at time i based on (M, )
• sensor Logic (at time step n)
if error(p[n], spred[n] ) >
send p[n] to server
• server logic (at time step n)
• If new update p[n] received at step n
s[n] = p[n]
Else
s[n] = spred[n] based on model (M, )
UCI Database Group
Challenges in Prediction
• Simple versus complex models?• Complex and more accurate models require more parameters (that will
need to be transmitted).
• Goal is to minimize communication not necessarily best prediction
• How is a model M generated?• static -- one out of a fixed set of models
• dynamic -- dynamically learn a model from data
• When should a model M or parameters be changed?
• immediately on model violation:
– too aggressive -- violation may be a temporary phenomena
• never changed:
– too conservative -- data rarely follows a single model
UCI Database Group
Challenges in Prediction (cont.)
• who does the model update?
• Server
– Long-haul prediction models possible, since server maintains history
– might not predict recent behavior well since server does not know exact
S sequence; server has only samples
– extra communication to inform the producer
• Producer
– better knowledge of recent history
– long haul models not feasible since producer does not have history
– producers share computation load
• Both
– server looks for new models, sensor performs parameter fitting given
existing models.
UCI Database Group
Archiving Sensor Data
• Often sensor-based applications are built with only the real-time
utility of time series data.
– Values at time instants <<n are discarded.
• Archiving such data consists of maintaining the entire S sequence,
or an approximation thereof.
• Importance of archiving:
– Discovering large-scale patterns
– Once-only phenomena, e.g., earthquakes
– Discovering “events” detected post facto by “rewinding” the time series
– Future usage of data which may be not known while it is being collected
UCI Database Group
Problem Formulation
• Let P = < p[1], p[2], …, p[n] > be the sensor time series
• Let S = < s[1], s[2], …, s[n] > be the server side representation
• A within archive quality data archival protocol guarantees that
error(p[i], s[i]) < archive
• Trivial Solution: modify collection protocol to collect data at quality guarantee of min(archive , collect)
– then prediction model by itself will provide a archive quality data stream that can be archived.
• Better solutions possible since – archived data not needed for immediate access by real-time or forecasting applications
(such as monitoring, tracking) – compression can be used to reduce data transfer
UCI Database Group
Data Archival Protocol
• Sensors compresses observed time series p[1:n] and sends a lossy compression to the server
• At time n :
– p[1:n-nlag] is at the server in compressed form s’ [1:n-nlag] within-
archive
– s[n-nlag+1:n] is estimated via a predictive model (M, )
• collection protocol guarantees that this remains within- collect
– s[n+1:] can be predicted but its quality is not guaranteed
(because it is in the future and thus the sensor has not observed
these values)
…p[n], p[n-1], .. compress
Sensor memory buffer
Sensor updates for data collection
Compressed representation for archiving
processing at sensor exploited to reduce communication cost and hence battery drain
UCI Database Group
Piecewise Constant Approximation (PCA)
• Given a time series Sn = s[1:n] a piecewise constant approximation
of it is a sequence
PCA(Sn) = < (ci, ei) >
that allows us to estimate s[j] as:
scapt [j] = ci if j in [ei-1+1, ei]
= c1 if j<e1
Time
Value
e1 e2 e3 e4
c1
c2
c3
c4
UCI Database Group
Online Compression using PCA
• Goal: Given stream of sensor values, generate a within-archive PCA
representation of a time series
• Approach (PMC-midrange)
– Maintain m, M as the minimum/maximum values of observed samples
since last segment
– On processing p[n], update m and M if needed
• if M - m > 2archive , output a segment ((m+M )/2, n)
Time
Value
Example: archive = 1.5
1 2 3 4 5
23
4
2.5
6
UCI Database Group
Online Compression using PCA
• PMC-MR …
– guarantees that each segment compresses the corresponding
time series segment to within-archive
– requires O(1) storage
– is instance optimal
• no other PCA representation with fewer segments can meet the
within-archive constraint
• Variant of PMC-MR
– PMC-MEAN, which takes the mean of the samples seen thus far instead
of mid range.
UCI Database Group
Improving PMC using Prediction• Observation: Prediction models guarantee a within- collect version of the time
series at server even before the compressed time series arrives from the
producer.
• Can the prediction model be exploited to reduce the overhead of compression.
– If archive> collect no additional effort is required for archival --> simply archive the
predicted model.
• Approach:
– Define an error time series E[i] = p[i]-spred[i]
– Compress E[1:n] to within-archive instead of compressing p[1:n]
– The archive contains the prediction parameters and the compressed error time
series
– Within-archive of E[I] + (M, ) can be used to reconstruct a within- archive version of p
UCI Database Group
Combing Compression and Prediction (Example)
-5
0
5
10
15
20
25
30
0 10 20 30 40 50 60
Predicted Time Series
Actual Time Series
-5
0
5
10
15
20
25
0 10 20 30 40 50 60
Actual Time Series
Compressed Time Series
(7 segments)
-4
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
1
Compressed Error
(2 segments)
Error =
Actual – Predicted
UCI Database Group
Estimating Time Series Values
• Historical samples (before n-nlag) is maintained at the server within-archive
• Recent samples (between n-nlag+1 and n) is maintained by the sensor
and predicted at the server.
• If an application requires q precision, then:
– if q collect then it must wait for time in case a parameter refresh is en route
– if q archive but q < collect then it may probe the sensor or wait for a
compressed segment
– Otherwise only probing meets precision
• For future samples (after n) immediate probing not available as an
option
UCI Database Group
Experiments
• Data sets:
– Synthetic Random-Walk
• x[1] = 0 and x[i]=x[i-1]+sn where sn drawn uniformly from [-1,1]
– Oceanographic Buoy Data
• Environmental attributes (temperature, salinity, wind-speed, etc.) sampled at 10min intervals
from a buoy in the Pacific Ocean (Tropical Atmosphere Ocean Project, Pacific Marine
Environment Laboratory)
– GPS data collected using IPAQs
• Experiments to test:
– Compression Performance of PMC
– Benefits of Model Selection
– Query Accuracy over Compressed Data
– Benefits of Prediction/Compression Combination
UCI Database Group
Compression Performance
K/n ratio: number of segments/number of samples
UCI Database Group
Query Performance Over Compressed Data
“How many sensors have values >v?” (Mean selectivity = 50)
UCI Database Group
Impact of Model Selection
K/n ratio: number of segments/number of samples. pred is the localization tolerance in meters
• Objects moved at approximately constant speed (+ measurement noise)
•Three models used:• loc[n] = c• loc[n] = c+vt• loc[n] = c+vt+0.5at2
•Parameters v, a were estimated at sensor over moving-window of 5 samples
UCI Database Group
Combining Prediction with Compression
K/n ratio: number of segments/number of samples
UCI Database Group
QUASAR ClientTime Series
GPS Mobility Data from Mobile Clients (iPAQs)
Latitude Time Series: 1800 samples
Compressed Time Series (PMC-MR, ICDE 2003)Accuracy of ~100 m130 segments
UCI Database Group
Query Processing in Quasar
• Problem Definition– Given
• sensor time series with quality-guarantees captured at the server• A query with a specified quality-tolerance
– Return• query results incurring least cost
• Techniques depend upon – nature of queries – Cost measures
• resource consumption -- energy, communication, I/O• query response time
UCI Database Group
Aggregate Queries
9
6
3
8
27
Q
S
minQ = 2
maxQ = 7
countQ = 3
sumQ = 2+7+6 = 15
avgQ = 15/3 = 5
UCI Database Group
Processing Aggregate Queries (minimize producer probe)
• MIN Query– c = minj(si.high)
– b = c - query
– Probe all sensors where sj.low < b• only s1 and s3 will be probed
• Sum Query – select a minimal subset S’ S such that
si in S’ (jpred) >= si in S(j
pred)- query
– If query = 15, only s1 will be probed
Let S = <s1,s2, …,sn> be set of sensors that meet the query criteria
si.high = sipred[t] + j
pred
sj.low = sipred[t] - j
pred
a b c
s1s2
sn
s3
s1s2
s5
s3 s4
10
5
25
3
UCI Database Group
Minimizing Cost at Server
• Error tolerance of queries can be exploited to reduce processing at server.
• Key Idea– Use a multi-resolution index structure (MRA-tree) for
processing aggregate queries at server.– An MRA-Tree is a modified multi-dimensional index trees (R-
Tree, quadtree, Hybrid tree, etc.)– A non-leaf node contains (for each of its subtrees) four
aggregates {MIN,MAX,COUNT,SUM} – A leaf node contains the actual data points (sensor models)
UCI Database Group
MRA Tree Data Structure
S1 S2 S3
S4
S5
S6
S7
S8
Spatial View
A
B
C
D
E
F
G
S7S2 S3 S4 S5 S6S1 S8
A
B C
D E F G
Tree Structure View
UCI Database Group
min
max
count
sum
2
3
9
4
4
2
9
5
1
4
4
2
6
1
6
6
Non-Leaf Node
Disk Page Pointers(each costs 1 I/O)
Leaf Node
M1
1
M2
2
M3
3
Probe “Pointers”(each costs 2 messages)
MRA-Tree Node Structure
UCI Database Group
QN
disjoint
contains
QNQN
is contained
QN
partially overlaps
• Two sets of nodes: – NP (partial contribution to the query)– NC (complete contribution)
Node Classification
UCI Database Group
Aggregate Queries using MRA Tree
• Initialize NP with the root• At each iteration: Remove one node N from NP and for each
Nchild of its children
– discard, if Nchild disjoint with Q
– insert into NP if Q is contained or partially overlaps with Nchild
– “insert” into NC if Q contains Nchild (we only need to maintain aggNC)
– compute the best estimate based on contents of NP and NC
QN
UCI Database Group
MIN (and MAX)
3
9
4
5
Interval
minNC = min { 4, 5 } = 4
minNP = min { 3, 9 } = 3
L = min {minNC, minNP} = 3
H = minNC = 4
hence, I = [3, 4]
Estimate
Lower bound:
E(minQ) = L = 3
Traversal
Choose N NP:
minN = minNP
UCI Database Group
MRA Tree Traversal
S7S2 S3 S4 S5 S6S1 S8
A
B C
D E F G
• Progressive answer refinement until NP is exhausted• Greedy priority-based local decision for next node to be explored based on:
– Cost (1 I/O or 2 messages)– Benefit (Expected Reduction in answer uncertainty)
UCI Database Group
Adaptive Tracking of mobile objects in sensor networks
Track visualization
Base station 1
Base station 2 Base station 3
ServerShow me the approximate track of the object with precision
Wireless Sensor Grid
object
Wireless link
Tracking Architecture A network of wireless acoustic sensors arranged as a grid transmitting via a base station to server
A track of the mobile object generated at the base station or server
Objective
Track a mobile object at the server such that the track deviates from the real trajectory within a user defined error threshold track with minimum communication overhead.
UCI Database Group
Sensor Model
Wireless sensors : battery operated, energy constrained
Operate on the received acoustic waveforms
Signal attenuation of target object given by : Is(t) = P /4 r2
P : source object power
r= distance of object from sensor
Is(t) = intensity reading at time t at ith sensor
Ith : Intensity threshold at ith sensor
UCI Database Group
Sensor States
• S0 : Monitor ( processor on, sensor on, radio off )
– shift to S1 if intensity above threshold
• S1 : Active state ( processor on, sensor on, radio on)
– send intensity readings to base station.
– On receiving message from BS containing error tolerance shift to S2
• S2 : Quasi-active (processor on, sensor on, radio intermittent)
– send intensity reading to BS if error from previous reading exceeds error threshold
Quasar Collection approach used in Quasi-active state
S0
(Initial state)
S2
S1
Receive BS message
Ii < I th
Ii > I th
Ii < I th
UCI Database Group
Server side protocol
Server maintains:
list of sensors in the active/ quasi-active state
history of their intensity readings over a period of time
Server Side Protocol
convert track quality to a relative intensity error at sensors
Send relative intensity error to sensor when sensor state = S1(
quasi- active state)
Triangulate using n sensor readings at discrete time intervals.
UCI Database Group
Basic Triangulation Algorithm (using 3 sensor readings)
(x1, y1) (x2, y2)
(x3, y3)
P: source object power, Ii = intensity reading at ith
sensor
(x-x1)2 + (y- y1)2 = P/4 I1
(x-x2)2 + (y- y2)2 = P/4 I2
(x-x3)2 + (y- y3)2 = P/4 I3
Solving we get (x, y)=f(x1,x2,x3,y1,y2,y3, P,I1, I2 , I3, )
(x, y)
More complex approaches to amalgamate more than three sensor readings possible
Based on numerical methods -- do not provide a closed form equation between sensor reading and tracking location !
Server can use simple triangulation to convert track quality to sensor intensity quality tolerances and a more complex approach to track.
UCI Database Group
Adaptive Tracking : Mapping track quality to sensor reading
Intensity ( I1 )
time
Intensity ( I2 )
time
Intensity ( I3 )
time
t i t( i+1 )
t i t( i+1 )
t i t( i+1 )
X (m)
Y (m)
Claim 1 (power constant)
Let Ii be the intensity value of sensor
If then, track quality is guaranteed to be within track
where and C is a constant derived from the known locations of the sensors and the power of the object.
Claim 2 (power varies between [Pmin , Pmax ])
If then
track quality is guaranteed to be within track
where C’ = C/ P2 and is a constant .
The above constraint is a conservative
estimate. Better bounds possible
)ξI /(1ξI|IΔ| i2ii
track
][|| max'
22
2max
min PIC
IP
PI i
trackii
Ctrack /2
UCI Database Group
Communication overhead further reduced by exploiting the predictability of the object being tracked
Static Prediction : sensor & server agree on a set of prediction models
only 2 models used: stationary & constant velocity Who Predicts: sensor based mobility prediction protocol
Every sensor by default follows a stationary model
Based on its history readings may change to constant velocity model (number of readings limited by sensor memory size)
informs server of model switch
Adaptive Tracking: prediction to improve performance
UCI Database Group
Actual Track versus track on Adaptive Tracking (error tolerance 20m)
• A restricted random motion : the object starts at (0,d) and moves from one node to another randomly chosen node until it walks out of the grid.
UCI Database Group
Energy Savings due to Adaptive Tracking
total energy consumption over all sensor nodes for random mobility model with varying track or track error.
significant energy savings using adaptive precision protocol over non adaptive tracking ( constant line in graph)
for a random model, prediction does not work well !
UCI Database Group
Energy consumption with Distance from BS
total energy consumption over all sensor nodes for random mobility model with varying base station distance from sensor grid.
As base station moves away, one can expect energy consumption to increase since transmission cost varies as d n ( n =2 )
adaptive precision algorithm gives us better results with increasing base station distance
UCI Database Group
Challenges & Ongoing Work
• Ongoing Work:– Supporting a larger class of SQL queries– Supporting continuous monitoring queries – Larger class of sensors (e.g., video sensors)– Better approaches to model fitting/switching in prediction
• In the future:– distributed Quasar architecture– optimizing quality given resource constraints– supporting applications with real-time constraints– dealing with failures
UCI Database Group
The DAS Project**Goals:
Support Database as a Service on the Internet
Collaboration:
IBM (Dr. Bala Iyer)
UCI (Gene Tsudik)
** Supported in part by NSF ITR grant entitled “Privacy in Database as a Service” and by the IBM Corporation
UCI Database Group
Software as a Service
• Get …– what you need
– when you need
• Pay …– what you use
• Don’t worry …– how to deploy, implement, maintain, upgrade
UCI Database Group
Software As a Service: Why?
• Advantages– reduced cost to client
• pay for what you use and not for hardware, software infrastructure or personnel to deploy, maintain, upgrade…
– reduced overall cost• cost amortization across
users
– Better service• leveraging experts
across organizations
• Driving Forces– Faster, cheaper, more accessible
networks– Virtualization in server and storage
technologies– Established e-business
infrastructures
• Already in Market– ERP and CRM (many examples)– More horizontal storage services,
disaster recovery services, e-mail services, rent-a-spreadsheet services etc.
– Sun ONE, Oracle Online Services, Microsoft .NET My Services etc
Better Service for Cheaper
UCI Database Group
Database As a Service
40%
51%
51%
57%
58%
0 10 20 30 40 50 60 70
% of respondents (Source: InfoWeek Research)
Platform Independence
Qualified Programmers
Compatibility
Qualified Administrators
Ease of Administration
Most Significant DB Execution Problems
• Why?– Most organizations need DBMSs– DBMSs extremely complex to deploy, setup, maintain– require skilled DBAs with high cost
UCI Database Group
What do we want to do?
• Database as a Service (DAS) Model – DB management transferred to service provider for
• backup, administration, restoration, space management, upgrades etc.
– use the database “as a service” provided by an ASP• use SW, HW, human resources of ASP, instead of your own
User
Application Service Provider (ASP)
Server
Internet
BUT….
UCI Database Group
Challenges
• Economic/business model?– How to charge for service, what kind of service guarantees can
be offered, costing of guarantees, liability of service provider.
• Powerful interfaces to support complete application development environment– User Interface for SQL, support for embedded SQL
programming, support for user defined interfaces, etc.
• Scalability in the web environment– overheads due to network latency (data proxies?)
• Privacy and Security – Protecting data at service providers from intruders and attacks.– Protecting clients from misuse of data by service providers– Ensuring result integrity
UCI Database Group
Data privacy from service provider
User Encrypted User Database
User Data
The problem is we do not trust “the service provider” for sensitive information!
Fact 1: Theft of intellectual property due to database vulnerabilities costs American businesses $103 billion annually
Fact 2: 45% of those attacks are conducted by insiders! (CSI/FBI Computer Crime and Security
Survey, 2001) encrypt the data and store it but still be able to run queries over the encrypted data do most of the work at the server
Server
Application Service ProviderUntrusted
Server Site
UCI Database Group
System Architecture
Encrypted User
Database
Query Translator
Server Site
Temporary Results
Result Filter
MetadataOriginal Query
Server Side Query
Encrypted Results
Actual Results
Service Provider
User
Client Site
Client Side Query ?
? ?
UCI Database Group
NetDB2 Service
• Developed in collaboration with IBM
• Deployed on the Internet about 2 years ago– Been used by 15
universities and more than 2500 students to help teaching database classes
• Currently offered through IBM Scholars Program
4
2
3
1
UCI Database Group
MARS Project**
Goals: integration of similarity retrieval and query refinement over structured and semi-structured databases to help interactive data analysis/mining
**Supported in part by NSF CAREER award, NSF grant entitled “learning digital behavior” and a KDD grant entitled “Mining events and entities over large spatio-temporal data sets”
UCI Database Group
Similarity Search in Databases (SR)
Alice
Honda sedan, inexpensive,
after 1994,around LA
Used Car Catalog
Year Model Mileage
TransmissionLocation
Color Price...
Bob
Honda sedan, inexpensive,after 1994,around LA
M3975LA90K94Honda Accord
A3500LA150K95Honda Accord
Similarity search (Bob – location more important)
Similarity is Subjective:
results reflect personal interpretation of
`around’,`inexpensive’, and relative importance
Exact Search semantics (unranked)
Similarity search (Alice – price more important)
.8A6000LA50K95Honda Prelude
.7A6500LA30K98Honda Accord
Honda Accord
Honda Accord
Honda Accord
94
95
94
60K
150K
90K
Irvine
LA
LA
5000
3500
3975
A
A
M
.5
1
1
1M3975LA90K94Honda Accord
.8A3500Malibu100K94Toyota Camry
Honda Accord
Honda Accord
Honda Accord
94
94
95
70K
60K
150K
San Diego
Irvine
LA
4500
5000
3500
A
A
A
.6
.7
1
MARS-QLselect * from user_car_catalogwhere model ~= Honda Accord, year >= 1994, price <= 4K, location ~= LA
UCI Database Group
Refined Results
Query Refinement (QR)
select * from user_car_catalogwhere model ~= Honda Accord, year >= 1994, price <= 4K, location ~= LA,
mileage~=60K
Refined Query
1M3975LA90K94Honda Accord–
.8A3500Malibu100K94Toyota Camry
Honda Accord
Honda Accord
Honda Accord
94
94
95
70K
60K
150K
San Diego
Irvine
LA
4500
5000
3500
A
A
A
.6
.7
1
Results
Mileage also important
.6A3500Malibu100K94Toyota Camry
.6A4500San Diego
70K94Honda Accord
.8A5000Irvine60K94Honda Accord
.9M3975LA90K94Honda Accord
Honda Accord 93 80K San Diego
4500 A .5
UCI Database Group
Why are SR and QR important?
• Most queries are similarity searches – Specially in exploratory data analysis tasks (e.g., catalog search)– Users have only a partial idea of their information need
• Existing Search technologies (text retrieval, SQL) do not provide appropriate support for SR and (almost) no support for QR.– Users must artificially convert similarity queries to keyword-searches
or exact-match queries
– Good mappings difficult or not feasible• Lack of good knowledge of the underlying data or its structure• Exact-match may be meaningless for certain data types (e.g., images, text, multimedia)
UCI Database Group
Similarity Access and Interactive Mining Architecture Search Client
Query SessionManager
Query LogManager/Miner
SimilarityQuery
Processor
Feedback-basedRefinement
Method
ORDBMS
SimilarityOperators
Types
FeedbackTable
Database
QueryLog
AnswerTable
ScoresTable
RefinementManager
History-basedRefinement
Method
Query/Feedback Ranked Results
InitialQuery
Feedback
RankedResults
Query Results
Schemes
RankingRules
Legend:--- logging__ Process
UCI Database Group
MARS Challenges...
• Learning queries from– user interactions– user profiles– past history of other users
• Efficient implementation of – similarity queries– refined queries
• Role of similarity queries in– OLAP– interactive data mining
UCI Database Group
Query-Session Manager
-parse the query- check query validity-generate schema for support tables - maintain sessions registry
Similarity Query Processor
-executes query on ORDBMS- ranks results (e.g. can exclude already see tuples, etc) - logs query(query or Top-k)
Refinement Manager
- maintains a registry of query refinement policies (content/collaborative)- generates the scores table- identifies and invokes intra-predicate refiners.
Query Log Manager/Miner
- maintains query log . Initial-Final pair . Top-K results . Complete trajectory - Query-query similarity (can have multiple policies) - Query clustering
UCI Database Group
Text Search Technologies (Altavista, Verity, Vality, Infoseek)
Strengths
support ranked retrieval
can handle missing data, synonyms, data entry errors
Approach
convert enterprise structured data into a searchable text index.
Limitations
cannot capture semantics of relationships in data
cannot capture semantics of non-text fields (e.g., multimedia)
limited support for refinement or preferences in current systems
cannot express similarity queries over structured or semi-structured data (e.g., price, location)
Honda accord near LA
approx. $4000
… … …
… … …
… … …
… … …… … …
… … …
MoviesActors
Directors
Al Pacino acted in a
movie directed by Francis
Ford Coppola
UCI Database Group
SQL-based Search TechnologiesOracle, Informix, DB2, Mercado
Approach
translate similarity query into exact SQL query.
Strengths
support structured as well as semi-structured data
support for arbitrary data types
Scalable attribute-based lookup
Limitations
translation is difficult or not possible
difficult to guess right ranges
causes near misses
not feasible for non-numeric fields
cannot rank answers based on relevance
does not account for user preference or query refinement
1994 Honda accord near LA approx.
$4000
select *from user_car_catalogwhere model = Honda Accord and 1993 year 1995 and dist(90210) 50 and price < 5000
top related