1 © cloudera, inc. all rights reserved. engines, algorithms, and data models josh wills | senior...
DESCRIPTION
3 © Cloudera, Inc. All rights reserved. My Current Data WarehouseTRANSCRIPT
![Page 1: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/1.jpg)
1© Cloudera, Inc. All rights reserved.
Engines, Algorithms, and Data ModelsJosh Wills | Senior Director of Data Science
From Dimensional Modeling to Machine Learning
![Page 2: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/2.jpg)
2© Cloudera, Inc. All rights reserved.
My First Data Warehouse
![Page 3: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/3.jpg)
3© Cloudera, Inc. All rights reserved.
My Current Data Warehouse
![Page 4: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/4.jpg)
4© Cloudera, Inc. All rights reserved.
The Rise of the Data Scientist
![Page 5: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/5.jpg)
5© Cloudera, Inc. All rights reserved.
Data Scientist Supply vs. Data Scientist Demand
![Page 6: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/6.jpg)
6© Cloudera, Inc. All rights reserved.
Moneyball and Data Science
![Page 7: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/7.jpg)
7© Cloudera, Inc. All rights reserved.
Choosing The Right Metrics
![Page 8: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/8.jpg)
8© Cloudera, Inc. All rights reserved.
1. Analyzing “Unstructured” Data Sources
![Page 9: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/9.jpg)
9© Cloudera, Inc. All rights reserved.
2. Building Machine Learning Models
![Page 10: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/10.jpg)
10© Cloudera, Inc. All rights reserved.
3. Turn Static Reports Into Analytical Applications
![Page 11: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/11.jpg)
11© Cloudera, Inc. All rights reserved.
Answering More Questions in Less Time
![Page 12: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/12.jpg)
12© Cloudera, Inc. All rights reserved.
How To Answer QuestionsLike A Data Scientist
![Page 13: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/13.jpg)
13© Cloudera, Inc. All rights reserved.
1. Read and deserialize input data.
2. Project/filter input records.
3. Shuffle: serialize it, send over the network, deserialize it.
4. Apply aggregation logic.
5. Serialize output data.
The Life of a Data Processing Job
![Page 14: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/14.jpg)
14© Cloudera, Inc. All rights reserved.
Handling the Cost of Serialization
![Page 15: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/15.jpg)
15© Cloudera, Inc. All rights reserved.
The Traditional RDBMS Approach
![Page 16: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/16.jpg)
16© Cloudera, Inc. All rights reserved.
The Cost of The Traditional RDBMS Approach
![Page 17: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/17.jpg)
17© Cloudera, Inc. All rights reserved.
Query Scheduling and Exploratory Data Analysis
![Page 18: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/18.jpg)
18© Cloudera, Inc. All rights reserved.
The Spark Approach
![Page 19: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/19.jpg)
19© Cloudera, Inc. All rights reserved.
The Cost of the Spark Approach
![Page 20: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/20.jpg)
20© Cloudera, Inc. All rights reserved.
The MapReduce Approach
![Page 21: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/21.jpg)
21© Cloudera, Inc. All rights reserved.
MapReduce In The Hands of a Data Scientist
![Page 22: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/22.jpg)
22© Cloudera, Inc. All rights reserved.
Example: Hive Multi-Insert
![Page 23: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/23.jpg)
23© Cloudera, Inc. All rights reserved.
Our Goal: Public Transit for Questions
![Page 24: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/24.jpg)
24© Cloudera, Inc. All rights reserved.
Data Modeling for Data Science
![Page 25: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/25.jpg)
25© Cloudera, Inc. All rights reserved.
Motivating Example: Spelling Correction
![Page 26: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/26.jpg)
26© Cloudera, Inc. All rights reserved.
Event Series Analytics
![Page 27: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/27.jpg)
27© Cloudera, Inc. All rights reserved.
A Simple Star Schema for Spell Correction
![Page 28: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/28.jpg)
28© Cloudera, Inc. All rights reserved.
The Combinatorial Explosion
![Page 29: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/29.jpg)
29© Cloudera, Inc. All rights reserved.
• What parameters does this model need…• during the analysis phase?• during deployment?
• Some Candidates• Lag time between events• Similarity of queries• What else?
Designing the Spell Correction Data Product
![Page 30: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/30.jpg)
30© Cloudera, Inc. All rights reserved.
A Supernova Schema for Search
![Page 31: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/31.jpg)
31© Cloudera, Inc. All rights reserved.
Spell Correction in SQL
![Page 32: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/32.jpg)
32© Cloudera, Inc. All rights reserved.
Exhibit: http://github.com/jwills/exhibit
![Page 33: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/33.jpg)
33© Cloudera, Inc. All rights reserved.
Querying Nested Types with Impala
![Page 34: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/34.jpg)
34© Cloudera, Inc. All rights reserved.
• Core Metric: # Outputs/ # Jobs• Measure on both an individual and
aggregate level• Drive the marginal cost of asking one
additional question towards zero• Point business analysts at output
tables for interactive analysis with Impala• Self-serve BI frees up resources
(compute + data science time)
Trading Up: From Data Analyst to Data Scientist
![Page 35: 1 © Cloudera, Inc. All rights reserved. Engines, Algorithms, and Data Models Josh Wills | Senior Director of Data Science From Dimensional Modeling to](https://reader036.vdocument.in/reader036/viewer/2022062311/5a4d1b107f8b9ab05998f2ed/html5/thumbnails/35.jpg)
35© Cloudera, Inc. All rights reserved.
Thanks!@josh_wills