christopher ré joint work with the hazy team cs.wisc /hazy
DESCRIPTION
Christopher Ré Joint work with the Hazy Team http:// www.cs.wisc.edu /hazy. Two Trends that Drive Hazy. Data in unprecedented number of formats. 2. Arms race for deeper understanding of data. Automated Statistical AND Manage Data RDBMS. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/1.jpg)
Christopher RéJoint work with the Hazy Teamhttp://www.cs.wisc.edu/hazy
![Page 2: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/2.jpg)
Two Trends that Drive Hazy1. Data in unprecedented number of formats
Hazy integrates statistical techniques into an RDBMS
2. Arms race for deeper understanding of dataAutomated Statistical AND Manage Data RDBMS
Hazy Hypothesis: Handful of statistical operators capture a diverse set of applications.
![Page 3: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/3.jpg)
Outline
Three Application Areas for Hazy
Drill Down: One Text Application
Maintaining the Output of Classification
Hazy Heads to the South Pole
![Page 4: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/4.jpg)
Data constantly generated on the Web, Twitter, Blogs, and Facebook
Build tools to lower cost of analysis
Extract and Classify sentiment about products, ad campaigns, and customer facing entities.
Statistical tools for extraction (e.g., CRFs) and classification (e.g., SVM). Performance and maintenance are data
management challenges (DMC)
![Page 5: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/5.jpg)
DMC: Transform and maintain large volumes of sensor data and derived analysis
A physicist interpolates sensor readings and uses regression to more deeply understand their data
Models that maps sequences of words to entities similar to some
models that maps sensor readings to meaning
![Page 6: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/6.jpg)
OCR and Speech
DMC: Process large volumes of statistical data
Getting text is challenging! (statistical model of transcription errors)
A social scientist wants to extract the frequency of synonyms of English words in 18th century texts.
Output of speech and OCR models similar to output of text labeling models
OCR & Speech
![Page 7: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/7.jpg)
Takeaway and Implications
Statistical processing on large data enables wide variety of new applications.
Key challenges are maintenance and performance (data management challenges)
Hazy Hypothesis: Handful of statistical operators capture a diverse set of applications
![Page 8: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/8.jpg)
Outline
Three Application Areas for Hazy
Drill Down: One Text Application
Maintaining the Output of Classification
Hazy Heads to the South Pole
![Page 9: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/9.jpg)
Classify publications by subject area
![Page 10: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/10.jpg)
The workflow requires several stepsClassify publication by subject area
Hazy Evidence: We know names for these operators
Simplified workflow1. Paper references are crawled from the Web.2. Entities (Papers, Authors,…) are extracted and deduplicated.3. Each paper is classified by subject area4. DB is queried to render Web page.
We still use the RDBMS for rendering, reports, etc.
![Page 11: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/11.jpg)
How Hazy Helps
![Page 12: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/12.jpg)
Statistical Computations Specified Declaratively
Hazy/RDBMS
CREATE CLASSIFICATION VIEW V(id,label) ENTITIES FROM Papers
EXAMPLES FROM Example
Declarative SQL-Like Program
Tuples In. Tuples out. Hazy handles the statistical and traditional details.
![Page 13: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/13.jpg)
Hazy Helps with Corrections
Hazy/ RDBMS
Paper 10 is not about query optimization -- it is about
Information Extraction
Easy as an INSERT: Update fixes that entry – and perhaps more – automatically.
CREATE CLASSIFICATION VIEW V(id,label) ENTITIES FROM Papers
EXAMPLES FROM Example
Declarative SQL-Like Program
![Page 14: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/14.jpg)
Design Goals: Hazy should…
• … look like SQL as much as possible– Ideal: application unaware of statistical techniques– Build on solutions for classical data management
problems
• … automate routine tasks– E.g., updates propagate through the system– Eventually, order operators for performance
![Page 15: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/15.jpg)
Where Hazy is Now
![Page 16: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/16.jpg)
• In PostgreSQL, we’ve built:– Classification: SVMs, Least Squares– Deduplication: synonym detection and coref– Factor Analysis: Low-Rank for NetFlix– Transducers for Sequences: Text, Audio, & OCR– Sophisticated Reasoning: Markov Logic Networks
Building Like Mad (Cows)
Developer declares task to Hazy using
SQL-like views
CREATE CLASSIFICATION VIEW V(id,label) ENTITIES FROM Paper(id, vec) EXAMPLES FROM EX_Paper (id,vec,label) USING SVM_L2
Model-based Views (Deshpande et al)
![Page 17: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/17.jpg)
Reasoning by Analogy…
Classical Hazy OperatorSelection ClassificationProjection DeduplicationJoin Factor AnalysisSQL’s LIKE Transducer algebraConstraints Markov Logic Networks
Hazy Hypothesis: Handful of statistical operators capture a diverse set of applications.
![Page 18: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/18.jpg)
Outline
Three Application Areas for Hazy
Drill Down: One Text Application
Maintaining the Output of Classification
Hazy Heads to the South Pole
![Page 19: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/19.jpg)
Maintenance: What about corrections?
Hazy/ RDBMS
Paper 10 is not about query optimization -- it is about
Information Extraction
Easy as an INSERT: Update fixes that entry and others automatically! How does Hazy do this?
CREATE CLASSIFICATION VIEW …. ENTITIES FROM PAPERS …
EXAMPLES FROM Ex…
Declarative SQL-Like Program
![Page 20: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/20.jpg)
Background: Linear Models
Experts: Logistic Regressions, SVMs, with/without Kernels. We leverage that they all perform inference the same way.
Label papers as DB Papers or Non-DB Papers
2. Classify via plane
1 2
3
45
DB Papers
Non-DB Papers
1. Map each papers to Rdw
![Page 21: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/21.jpg)
What happens on an update?
Paper 3 is not a Database Paper!
Oh no! The model (w) changes in wild and crazy ways! … well not really.
1 2
3
45
DB Papers
Non-DB Papers
w
![Page 22: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/22.jpg)
Intuition: Model Changes only Slightly
Paper 3 is not a Database Paper!
It would be a waste of effort to relabel all 1, 4, 5. Can we just focus in on 2 and 3?
1 2
3
45
DB Papers
Non-DB Papers
ww’
That is, ||w – w’|| is small.
![Page 23: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/23.jpg)
Hazy-ClassifyCluster data by how likely to change classes
1 2
3
45
DB Papers
Non-DB Papers
Prop: There exist hw and lw functions of ||w – w’|| s.t. pid can change labels only if pid.eps in [lw,hw]
only relabel here
e4
e5
hw
lw
1 2DB Papers
ww’
PID eps1 +0.23 +0.12 -0.25 -0.44 -3
![Page 24: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/24.jpg)
But the clustering may get out of date!
Setup: Measure the time to recluster, call that CSet a timer T = 0 // intuition, the waste time.
On each update: Alg from prev. slide. Add time to T. If T > C then recluster and set T = 0
Two claims that can made precise (theorems):A. Algorithm w/in a factor of 2 of optimal run time on any instance. B. Essentially optimal deterministic strategy.
Need to recluster periodically, how do we decide?
On DBLife, Citeseer, and ML datasets, Hazy is 10x+ faster than scan.
![Page 25: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/25.jpg)
Other Features of Hazy-Classify
• Hazy has a main-memory (MM) engine
• Hazy-Classify supports Eager and Lazy Materialization Strategies– Improves either by an order of magnitude
• An index that keeps in memory only elements likely to change classes– Allows 1% of data in memory with MM perf.– Enables active learning on 100Gb+ corpus.
![Page 26: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/26.jpg)
Hazy Heads to the South Pole
![Page 27: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/27.jpg)
IceCube
Digital Optical Module (DOM)
![Page 28: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/28.jpg)
Workflow of IceCube
In Ice: Detection occurs.
At Pole: Algorithm says “Interesting!”
In Madison: Lots of data analysis.
Via satellite: Interesting DOM readings
![Page 29: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/29.jpg)
A Key Phase: Detecting Direction
Mathematical structure used to help track neutrinos is similar to labeling text/tracking/OCR!
Here, Speed ≈ Quality
![Page 30: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/30.jpg)
Framework: Regression Problems
Examples: 1. Neutrino Tracking: yi is a sensor reading2. CRFs: yi is (token, label)3. Netflix: yi is (user,movie,rating)Others tools also fit this model,e.g., SVMs
€
min x f (x,yi )+P(x)i=1
N
∑
Claim: General data analysis technique that is amenable to RDBMS processing
x the model
yiA data item
f Scores the error
P Enforces prior
![Page 31: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/31.jpg)
Background: Gradient Methods
€
F(x) = f (x,y i) +P(x)i=1
N
∑Gradient Methods: Iterative. 1. Take current x, 2. Derivate F wrt x, 3. Move in opposite direction
€
x k+1 = x k −∇F(x k )
€
x k
€
x k+1F(x)
![Page 32: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/32.jpg)
Incremental Gradient Methods
€
F(x) = f (x,y i) +P(x)i=1
N
∑Gradient Methods: Iterative. 1. Take current x, 2. Approximate derivative of F wrt x, 3. Move in opposite direction
€
x k+1 = x k −∇F(x k )
€
∇F(x)≈ N∇F(x,y j ) +∇P(x)Can use a single data item to approximate
![Page 33: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/33.jpg)
Incremental Gradient Methods (iGMs)
Why use iGMs? Provably, iGMs converge to an optimal for many problems, but the real reason is:
iGMs are fast.
Technical connection: iGM processing ≈ a single tuple. RBDMS processing techniques apply
No more complicated than a COUNT.
![Page 34: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/34.jpg)
Hazy’s SQL version of Incremental Gradient
Code generated automatically. Hazy Params: $mid and $model.
-- (1) Curry (cache) the model, xSELECT cache_model($mid, $x); -- (2) ShuffleSELECT * INTO Shuffled FROM Data ORDER BY RANDOM(); -- (3) Execute the Gradient StepsSELECT GRAD($mid, y) FROM Shuffled-- (4) Write the model back to the model instance tableUPDATE model_instance SET model=retrieve_model($mid) WHERE mid=$mid;
Input: Data(id,y), GRAD
Hazy does more optimization. This is a basic block.
![Page 35: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/35.jpg)
More applications than a cube of ice!- Recommending Movies on Netflix
– Experts: Low-rank Factorization. – Old SOTA : 4+ hours. – In RDBMS : 40 minutes.– Hazy-MM : 2 minutes.
Prof.Ben
Recht
Buzzwords: A novel parallel execution strategy for incremental gradient methods to optimize convex relaxations with constraints or proximal point operators.
Hazy-MM: We compile plans using g++ with a main memory engine (useful in IceCube).
Same Quality
![Page 36: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/36.jpg)
A Common BackboneAll of Hazy’s operators can have a
weight learning or regression phase.
Classical Hazy EnhancedSelection ClassificationProjection DeduplicationJoin Factor AnalysisSQL’s LIKE Transducer algebraConstraints Markov Logic Networks
![Page 37: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/37.jpg)
Futuring (I learned this term from my wife)
• A main-memory engine for use in IceCube
• We are releasing our algorithms to Mahout
• We have some corporate partners who have given access to their data.
![Page 38: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/38.jpg)
Incomplete Related WorkNumeric methods to Hadoop Ricardo [Das et al 2010], Mahout [Ng et al].
Incremental Gradients Bottou, VowPal Rabbit (Y!), Pegasos
Declarative IESystem T From IBM, DBLife [Doan et al], [Wang et al 2010]
DeduplicationCoref Systems (UIUC), Dedupalog [ICDE09]
Rules+Probability: MLNs [Richardson 05]PRMs [Koller 99]
Model-Based Views: MauveDB [Deshpande et. al 05]
![Page 39: Christopher Ré Joint work with the Hazy Team cs.wisc /hazy](https://reader036.vdocument.in/reader036/viewer/2022062811/56815ffa550346895dcef999/html5/thumbnails/39.jpg)
Conclusion
Future of data management is in managing these less precise sources
Key challenges: performance and maintenance. Hazy attacks this.
Hazy Hypothesis: Handful of statistical operators capture a diverse set of applications.