control overview
DESCRIPTION
CONTROL Overview. CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden. Context (wild assertions). Value from information The pressing problem in CS (?) (!!) - PowerPoint PPT PresentationTRANSCRIPT
CONTROL Overview
CONTROL groupJoe Hellerstein, Ron Avnur, Christian
Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk
Wylie, UC BerkeleyPeter Haas, IBM Almaden
Context (wild assertions)
• Value from information– The pressing problem in CS (?) (!!)
• “Point” querying and data management is a solved problem– at least for traditional data (business data,
documents)
• “Big picture” analysis still hard
Data Analysis c. 1998
• Complex: people using many tools– SQL Aggregation (Decision Support Sys, OLAP)– AI-style WYGIWIGY systems (e.g. Data Mining, IR)
• Both are Black Boxes– Users must iterate to get what they want– batch processing (big picture = big wait)
• We are failing important users!– Decision support is for decision-makers!– Black box is the world’s worst UI
Black Box Begone!
• Black boxes are bad– cannot be observed while running– cannot be controlled while running
• These tools can be very slow– exacerbates previous problems
• Thesis:– there will always be slow computer
programs, usually data-intensive– fundamental issue: looking into the box...
Crystal Balls
• Allow users to observe processing– as opposed to “lucite watches”
• Allow users to predict future• Ideally, allow users to change
future– online control of processing
• The CONTROL Project:– online delivery, estimation, and control for
data-intensive processes
Performance Regime for CONTROL
• Online performance:– Maximize 1st derivative of the “mirth index”
Time
100%
CONTROLTraditional
Examples
• Online Aggregation– Informix Dynamic Server
• Enhanced by UCB students with Control algorithms• Lots of algorithmics, many fussy end-to-end system
issues [Avnur, Hellerstein, Raman DMKD ’00]– IBM has ongoing project to do this in DB2– IBM buys Informix (4/01)
• Online Visualization– Visual enumeration & aggregation
• Interactive data cleaning & analysis– Potter’s Wheel ABC– Online “enumeration” and discrepancy detection
Example: Online Aggregation
SELECT AVG(gpa) FROM studentsGROUP BY college
Example: Online Data Visualization
• In Tioga DataSplash
Visual Transformation Shot
Scalable Spreadsheets
Decision-Support in DBMSs
• Aggregation queries– compute a set of qualifying records– partition the set into groups– compute aggregation functions on the
groups– e.g.:
Select college, AVG(grade)From ENROLLGroup By college;
Interactive Decision Support?• Precomputation
– the typical “OLAP” approach (a.k.a. Data Cubes)– doesn’t scale, no ad hoc analysis– blindingly fast when it works
• Sampling– makes real people nervous?– no ad hoc precision
• sample in advance• can’t vary stats requirements
– per-query granularity only
Online Aggregation• Think “progressive” sampling
– a la images in a web browser– good estimates quickly, improve over time
• Shift in performance goals– online mirth index
• Shift in the science– UI emphasis drives system design– leads to different data delivery, result estimation– motivates online control
Not everything can be CONTROLed• “needle in haystack” scenarios
– the nemesis of any sampling approach– e.g. highly selective queries, MIN, MAX,
MEDIAN
• not useless, though– unlike presampling, users can get some info
(e.g. max-so-far)
• we advocate a mixed approach– explore the big picture with online processing– when you drill down to the needles, or want
full precision, go batch-style– can do both in parallel
New Techniques
• Online Reordering– gives control of group delivery rates– applicable outside the RDBMS setting
• Ripple Join family of join algorithms– comes in naïve, block & hash
• Statistical estimators & confidence intervals– for single-table & multi-table queries– for AVG, SUM, COUNT, STDEV– Leave it to Peter
• Visual estimators & analysis
ST
RS
T
R
Online Reordering
• users perceive data being processed over time– prioritize processing for “interesting” tuples– interest based on user-specified preferences
• reorder dataflow so that interesting tuples go first
• encapsulate reordering as pipelined dataflow operator
• online aggregation– for SQL aggregate queries, give gradually improving
estimates – with confidence intervals
– allow users to speed up estimate refinement for groups of interest
– prioritize for processing at a per-group granularity
SELECT AVG(gpa) FROM studentsGROUP BY college
Context: an application of reordering
Framework for Online Reordering
• want no delay in processing in general, reordering can only be best-effort
• typically process/consume slower than produce– exploit throughput difference to reorder
• two aspects– mechanism for best-effort reordering– reordering policy
acddbadb...
abcdabc..
reorder consumeproduceprocess
f(t)
user interest
network xfer.
Juggle mechanism for reordering
• two threads -- prefetch from input -- spool/enrich from auxiliary side disk
• juggle data between buffer and side disk
– keep buffer full of “interesting” items
– getNext chooses best item currently on buffer
• getNext, enrich/spool decisions -- based on reordering policy
• side disk management– hash index, populated in a way that postpones random I/O
buffer
spoolprefetch enrich
getNext
side diskproduce
process/consume
Reordering policies
• quality of feedback for a prefix t1t2
…tk
QOF(UP(t1), UP(t2
), … UP(tk )), UP = user preference
– determined by application
• goodness of reordering: dQOF/dt
• implication for juggle mechanism – process gets item from buffer that increases QOF the most
– juggle tries to maintain buffer with such items
time
QOFGOAL: “good” permutation of
items t1…tn to t1…tn
QOF in Online Aggregation
• avg weighted confidence interval
• preference acts as weight on confidence interval(Recall from Central Limit Theorem that sample mean’s confidence
interval half-width is proportional to /n. Conservative (Hoeffding) confidence intervals also have a n in the denominator. So…)
QOF = UPi / ni , ni = number of tuples processed from group I
process pulls items from group with max UPi / nini
desired ratio of group i tuples on buffer = UPi2/3/ UPj
2/3
– juggle tries to maintain this by enrich/spool
Other QOF functions
• rate of processing (for a group) preference – QOF = (ni - nUPi)
2 (variance from ideal proportions)
process pulls items from group with max (nUPi - ni )
desired ratio of group i tuples in buffer = UPi
Results: Reordering in Online Aggregation
• implemented in Informix UDO server• experiments with modified TPC-D queries• questions:
– how much throughput difference is needed for reordering– can we reorder handle skewed data
• one stress test: skew, very small proc. cost
– index-only join– 5 orderpriorities, zipf distribution
A B C D E 1 1/2 1/3 1/4 1/5
consume
process
scan
juggle
indexSELECT AVG(o_totalprice), o_orderpriority
FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey)GROUP BY o_orderpriority
Performance results A B C D E
initial preferences 1 1 1 5 3
after T1 1 1 3.5 0.5 1
time
# t
up
les
pro
cess
ed
• 3 times faster for interesting groups• 2% completion time overhead
E
C A
con
fid
en
ce
inte
rval
time
Ripple Joins
• Good confidence intervals for joins of samples– Vs. samples of joins! – Requires “Cross-Product CLT”
• Progressively Refining join:– ever-larger rectangles in R S– we can update confidence intervals at “corners”– comes in loop, index and hash flavors
• Benefits:– sample from both relations simultaneously– “animation rate”:
• Goal for the next “corner”, determines an optimization problem based on observations so far
• Old-fashioned systems are one extreme– adaptively tune “aspect ratio” for next “corner”
• sample from higher-variance relation faster – intimate relationship between delivery and estimation
Traditional
R
S
Ripple
R
S
Haas & Hellerstein, SIGMOD 99
Aspect Ratios
• Consider an extreme example:
• In general, to get to the next corner:– Need a cost model parameterized by relation
• Different for block and hash
– “Benefit”: change in confidence interval– An online linear optimization problem
• Arguments about estimates converging quickly, stabilizing…
Fussy Implementation Details• How to implement as an iterator? Issues:
– Need cursors on all inputs (as usual)– Need to maintain aspect ratios– Need to maintain current “inner” & cursor
• I.e. the relation currently being scanned– Need to know current sampling step
• To know how far to scan current “inner”– Need to know “starter” for next step
• Determines length of scan (see pic), end of sampling step• And pass that role along at EOF
Ensuring Aspect Ratios
Ripple Join Performance
• Too lazy to fetch graphs, but…– Typical orders of magnitude benefit vs.
batch…
CONTROL Lessons
• Dream about UIs, work on systems– User needs drive systems design!
• Systems and statistics intertwine– “what unlike things must meet and mate”
• Art, Herman Melville
• Sloppy, adaptive systems a promising direction
Questions
• Where else do these lessons apply?– Outside of data analysis, manipulation
• Systems people think a lot about interfaces (APIs)…– Encapsulation, narrow interfaces …– In the CONTROL regime, how do you design these
APIs and build systems?
• Ubiquitous computing:– Is it about portable computing and point
access/delivery?– Or sensors/actuators, dataflow, big-picture queries?
More?
• CONTROL: http://control.cs.berkeley.edu– Overview: IEEE Computer, 8/99
• Telegraph: http://db.cs.berkeley.edu/telegraph
Backup slides
• The following slides may be used to answer questions...
Sampling
• Much is known here– Olken’s thesis– DB Sampling literature– more recent work by Peter Haas
• Progressive random sampling– can use a randomized access method (watch
dups!)– can maintain file in random order– can verify statistically that values are
independent of order as stored
Estimators & Confidence Intervals
• Conservative Confidence Intervals– Extensions of Hoeffding’s inequality– Appropriate early on, give wide intervals
• Large-Sample Confidence Intervals– Use Central Limit Theorem– Appropriate after “a while” (~dozens of tuples)– linear memory consumption– tight bounds
• Deterministic Intervals– only useful in “the endgame”