speed up interactive analytic queries over existing big data on hadoop with presto
DESCRIPTION
The slides for HadoopCon 2014 in Taiwan.TRANSCRIPT
![Page 1: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/1.jpg)
Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto
Liang-Chi Hsieh
HadoopCon 2014 in Taiwan
1
![Page 2: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/2.jpg)
In Today’s talk
• Introduction of Presto
• Distributed architecture
• Query model
• Deployment and configuration
• Data visualization with Presto - Demo
2
![Page 3: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/3.jpg)
SQL on/over Hadoop• Hive
• Matured and proven solution (0.13.x)
• Drawbacks: execution model based on MapReduce
• Better execution engines: Hive-Tez and Hive-Spark
!
• Alternative and usually faster options including
• Impala, Presto, Drill, ...
3
![Page 4: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/4.jpg)
Presto• Presto is a distributed SQL query engine optimized
for ad-hoc analysis at interactive speed
• Data scale: GBs to PBs
!
• Deployment at:
• Facebook, Netflix, Dropbox, Treasure Data, Airbnb, Qubole
4
![Page 5: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/5.jpg)
History of Presto• Fall 2012
• The development on Presto started at Facebook
• Spring 2013
• It was rolled out to the entire company and became major interactive data warehouse
• Winter 2013
• Open-sourced
5
![Page 6: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/6.jpg)
The Problems to Solve• Hive is not optimized for interactive data analysis as
the data size grows to petabyte scale
• In practice, we do need to have reduced data stored in an interactive DB that provides quick query response
• Redundant maintenance cost, out of date data view, data transferring, ...
• The need to incorporate other data that are not stored in HDFS
6
![Page 7: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/7.jpg)
Typical Batch Data Architecture
7
HDFS
Data Flow Batch Run
DB
Query• Views generated in batch maybe out of date
• Batch workflow is too slow
![Page 8: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/8.jpg)
Interactive Query on HDFS
8
HDFS
Data Flow Interactive query
Presto
Query
![Page 9: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/9.jpg)
Interactive Query on HDFS and other Data Sources
9
HDFS
Data Flow Interactive query
Presto
QueryMySQL Cassandra
![Page 10: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/10.jpg)
Distributed Architecture• Coordinator
• Parsing statements
• Planning queries
• Managing Presto workers !
• Worker
• Executing tasks
• Processing data
10
![Page 11: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/11.jpg)
11
![Page 12: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/12.jpg)
Storage Plugins• Connectors
• Providing interfaces for fetching metadata, getting data locations, accessing the data
• Current connectors (v0.76)
• Hive: Hadoop 1.x, Hadoop 2.x, CDH 4, CDH 5
• Cassandra
• MySQL
• Kafka
• PostgreSQL
12
![Page 13: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/13.jpg)
13
![Page 14: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/14.jpg)
Presto Clients
• Protocol: HTTP + JSON
!
• Client libraries available in several programming languages:
• Python, PHP, Ruby, Node.js, Java, R
!
• ODBC through Prestogres
14
![Page 15: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/15.jpg)
Query Model
• Presto’s execution engine does not use MapReduce
• It employs a custom query and execution engine
• Based on DAG that is more like Apache Tez, Spark or MPP databases
15
![Page 16: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/16.jpg)
Query Execution• Presto executes ANSI-compatible SQL statements
!
• Coordinator
• SQL parser
• Query planner
• Execution planner
• Workers
• Task execution scheduler
16
![Page 17: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/17.jpg)
Query Execution
Query planner
AST Query planExecution planner
Connector
Metadata
Execution plan
NodeManager
17
![Page 18: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/18.jpg)
Query Planner
SELECT name, count(*) from logs GROUP BY name
Logical query plan:
Table scan GROUP BY Output
Distributed query plan:
SQL:
Table scan
Stage-2
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
Exchange client
Output
Stage-1 Stage-0
18
![Page 19: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/19.jpg)
Distributed query plan:
Table scan
Stage-2
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
Exchange client
Output
Stage-1 Stage-0
Worker 1
Worker 2
Table scan
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
Exchange client
Output
Table scan
Partial aggregation
Output buffer
Exchange client
Final aggregation
Output buffer
* Tasks run on workers
19
![Page 20: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/20.jpg)
Query Execution on Presto
• SQL is converted into stages, tasks, drivers
• Tasks operate on splits that are sections of data
• Lowest stages retrieve splits from connectors
20
![Page 21: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/21.jpg)
Query Execution on Presto
• Tasks are run in parallel
• Pipelined to reduce wait time between stages
• One task fails then the query fails
!
• No disk I/O
• If aggregated data does not fit in memory, the query fails
• May spill to disk in future
21
![Page 22: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/22.jpg)
Deployment & Configuration
• Basically, there are four configurations to set up for Presto
• Node properties: environment configuration specific to each node
• JVM config
• Config properties: configuration for Presto server
• Catalog properties: configuration for connectors !
• Detailed documents are provided on Presto site
22
![Page 23: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/23.jpg)
Node Properties
• etc/node.properties
• Minimal configuration:
node.environment=production node.id=ffffffff-ffff-ffff-ffff-ffffffffffff node.data-dir=/var/presto/data
23
![Page 24: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/24.jpg)
Config Properties
• etc/config.properties
• Minimal configuration for coordinator:
coordinator=true node-scheduler.include-coordinator=false http-server.http.port=8080 task.max-memory=1GB discovery-server.enabled=true discovery.uri=http://example.net:8080
24
![Page 25: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/25.jpg)
Config Properties
• Minimal configuration for worker:
coordinator=false http-server.http.port=8080 task.max-memory=1GB discovery.uri=http://example.net:8080
25
![Page 26: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/26.jpg)
Catalog Properties
• Presto connectors are mounted in catalogs
• Create catalog properties in etc/catalog
• For example, the configuration etc/catalog/hive.properties for Hive connector:
connector.name=hive-hadoop2 hive.metastore.uri=thrift://example.net:9083
26
![Page 27: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/27.jpg)
Presto’s Roadmap
• In next year:
• Complex data structures
• Create table with partitioning
• Huge joins and aggregations
• Spill to disk
• Basic task recovery
• Native store
• Authentication & authorization
* Based on the Presto Meetup, May 201427
![Page 28: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/28.jpg)
Data Visualization with Presto - Demo
• There will be official ODBC driver for connecting Presto to major BI tools, according to Presto’s roadmap
• Prestogres provides alternative solution for now
• Use PostgreSQL’s ODBC driver
!
• It is also not difficult to integrate Presto with other data visualization tools such as Grafana
28
![Page 29: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/29.jpg)
Grafana
• An open source metrics dashboard and graph editor for Graphite, InfluxDB & OpenTSDB
• But we may not be satisfied with these DBs or just want to visualize data on HDFS, especially for large-scale data
29
![Page 30: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/30.jpg)
Integrating Presto with Grafana
• Presto provides many useful date & time functions
• current_date -> date
• current_time -> time with time zone
• current_timestamp -> timestamp with time zone
• from_unixtime(unixtime) → timestamp
• localtime -> time
• now() → timestamp with time zone
• to_unixtime(timestamp) → double
30
![Page 31: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/31.jpg)
Integrating Presto with Grafana
• Presto also supports many common aggregation functions
• avg(x) → double
• count(x) → bigint
• max(x) → [same as input]
• min(x) → [same as input]
• sum(x) → [same as input]
• …..
31
![Page 32: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/32.jpg)
Integrating Presto with Grafana
• So we implemented a custom datasource for Presto to work with Grafana
• Interactively visualize data on HDFS
HDFS
Interactive query
Presto
Grafana
32
![Page 33: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/33.jpg)
Demo
33
![Page 34: Speed up Interactive Analytic Queries over Existing Big Data on Hadoop with Presto](https://reader031.vdocument.in/reader031/viewer/2022020723/549fddc9ac795982328b4619/html5/thumbnails/34.jpg)
References• Martin Traverso, “Presto: Interacting with petabytes of data at
Facebook”
• Sadayuki Furuhashi, “Presto: Interactive SQL Query Engine for Big Data”
• Sundstrom, “Presto: Past, Present, and Future”
• “Presto Concepts” on Presto’s documents
34