Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
SPRINGONE2GXWASHINGTON, DC
Federated Queries with HAWQ SQL on Hadoop and Beyond
By Christian Tzolov @christzolov
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Safe Harbor Statement
The following is intended to outline the general direction of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract. Any information regarding pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. These purchasing decisions should only be based on features currently available. The development, release, and timing of any features or functionality described for Pivotal's offerings in this presentation remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward looking information in this presentation.
2
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/ 3
BioChristian Tzolov
Pivotal Technical Architect,
BigData, Hadoop, SpringXD
Apache Committer, Crunch PMC member
@christzolov
blog.tzolov.net
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Contents
• Data System Concepts• Data System Reference Architectures.• Real Life Use Case: OLTP and OLAP Data Systems Integration• Passive Data Synchronization (Demo)• Federated Queries With HAWQ• HAWQ Web Tables• HAWQ PXF Architecture• Geode PXF (Demo)
4
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
How Compute Arbitrary Functions
on Arbitrary Data
Query = Function (All Data)
5
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data System Concepts
6
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data
What is Data?
• Data is Information
• Derived vs. Fact Information
• Not derived information will be called Data
• Data hold to be true simply because it exists
• All other information is derived from the Data
7
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data System
System that computes arbitrary functions on arbitrary data.
• Low Latency
• High Throughput
• Scalability
• Ad hoc queries
• Provenance
8
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data System Principles
The 6-elements Application = Data-Native System
• Fact Data
• Immutable Data
• Deterministic Functions
• Data-Lineage
• Data Locality - space or temporal
• All Data vs. Working Set
9
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data System – Key Properties
10
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data System Architectures
11
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data System Architectural Patterns
How to build Data Systems?
• Data Lake
• Lambda
• Kappa / Samza
• Tachyon
12
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Data Lake (From ETL to ELT)
Single Store For All Data
• Store Raw Data
• No Data Schema Limitations
• No Data Volume Limitations
• Late evaluation
• Extract-Transform-Load (ETL) vs. Extract-Load-Transform (ELT)
• Data Lineage
13
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
• Batch Layer
• Serving Layer
• Speed Layer
14
Lambda Architecture
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
• Durable Queues (Kafka)• Replaying History• Versioned Outputs• Dynamic Output Switch
15
Kappa Architecture
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Tachyon
No write speed constrains
• Data-Lineage Based Fault Tolerance
• Asynchronous Checkpoints
• Unbounded Writes
16
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Real Life Use Case
17
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Real Life Use Case
• Integrate an In-Memory Data Grid - Geode (e.g. GemFire) with SQL-On-Hadoop OLAP system - HAWQ
• Provide an unified data view across both systems
• Use Geode as Slowly Changing Dimensions (SCDs) store for HAWQ
• Keep the Operational and Historical data in Sync
18
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
• Cache - Performance / Consistency / Resiliency
• Region
• Distributed j.u.Map on steroids
• Highly available, redundant
• Distributed
• Locator, Server, Client
Apache Geode & GemFire (OLTP)
19
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Geode use cases…
20
China RailwayCorporation
5,700 train stations4.5 million tickets per day20 million daily users1.4 billion page views per day40,000 visits per second
* http://pivotal.io/big-data/pivotal-gemfire
Indian Railways
7,000 stations72,000 miles of track23 million passengers daily120,000 concurrent users10,000 transactions per minute
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
HAWQ (SQL-on-HADOOP)
• Enterprise SQL on Hadoop analytic engine built around a high-performance Greenplum MPP Database
• HAWQ runs natively on Apache Hadoop, HDFS and YARN.• Supports multiple Hadoop file formats: Parquet, HDFS and Avro• 100% ANSI SQL compliant (ANSI SQL-92, SQL-99, and SQL-2003…)• ODBC and JDBC connectivity• Implemented in C and C++
21
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
HAWQ (SQL-on-HADOOP)
• Runs TPC-DS benchmark in half the wall clock time that Impala • Beats Impala by overall 454% in performance• Compared to Hive, HAWQ provides an additional of 344% of performance
improvement on complex queries• Impala and Apache Hive™ do not support all standard TPC-DS queries• HAWQ runs 100% of them natively.
22
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Test Environment
• HDFS – Distributed file system. PHD or HDP Hadoop distributions.
• HAWQ - SQL-on-Hadoop.
• Geode/GemFire - In-memory data grid.
• SpringXD – runtime integration and streaming platform.
• Apache Ambari – Manages all Clusters.
• Apache Zeppelin – Web interface for interacting with different Data Systems.
23
GEODE Locator, Servers
HAWQ Master,
Segments
HDFS NameNode, DataNodes
SPRING XD
Admin, Containers
Zeppelin Notebook
Ambari
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Test Environment - Ambari Management
24
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Passive Data Synchronization
25
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Passive Synchronization (Case 1)
26
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Passive Synchronization (Case 1) - Demo
27
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Passive Synchronization (Case 2) - gpfdist
28
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Passive Synchronization (Case 2): Demo
29
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Federated Queries With HAWQ
30
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Access dynamic data sources on a web server or by executing OS scripts.
31
HAWQ Web Table – Architecture
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
HAWQ Web Table
• HAWQ Web Table - access dynamic data sources on a web server or by executing OS scripts.
• Leverage Geode REST API and OQL.
• RESTful SpringBoot app to convert JSON into TSV.
32
CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...) EXECUTE E'curl http://<adapter proxy>/gemfire-api/v1/ queries/adhoc?q=<URLencoded OQL statement>' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\');
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
HAWQ Web Table – Limitations
• Not Scalable
• No “Filter Push Down”
• Inflexibility
• No Compression
• Requires Additional Components
33
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
HAWQ PXF/Geode External Table
34
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Pivotal eXtension Framework (PXF)
• Java-Based
• Parallel, High Throughput Data Access
• Heterogeneous Data Sources.
• ANSI-compliant SQL On Any Dataset
• Wide variety of PXF plugins
35
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
PXF Architecture
36
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
PXF Data Model
• Data Source is modeled as a collection of one or more Fragments.
• Each Fragment consists of many Rows that in turn are split into typed Fields.
• Analyzer (optional) provides PXF statistical data for the HAWQ query optimizer
• Metadata about the data source locations, access attributes, table schemas formats, SQL queries filters, etc
37
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
PXF Processors
38
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
PXF Runtime Model
39
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
CREATE EXTERNAL TABLE ext_table_name <attr list, ...> LOCATION('pxf://<host>:<port>/path/to/data? FRAGMENTER=package.name.FragmenterForX& ACCESSOR=package.name.AccessorForX& RESOLVER=package.name.ResolverForX& <Other custom user options>=<value>') FORMAT ‘custom'(formatter='pxfwritable_import');
40
PXF External Tables Definition
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Available PXF Plugins
• HdfsTextSimple• HdfsTextMulti• Hive• HiveRC• HiveText• Hbase• Avro
41
• Accumulo• Casandra• JSON• Redis• Geode/Gemfire• Pipes
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Build Geode PXF Plugin
42
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
<profile> <name>GEMFIRE</name> <description>A profile for reading Gemfire data</description> <plugins> <fragmenter>io.pivotal.pxf.plugins.gemfire.GemfireFragmenter</fragmenter> <accessor>io.pivotal.pxf.plugins.gemfire.GemfireAccessor</accessor> <resolver>io.pivotal.pxf.plugins.gemfire.GemfireResolver</resolver> </plugins> </profile>
43
PXF Geode Profile
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Geode External Table Definition
CREATE EXTERNAL TABLE <GEMFIRE_TABLE_NAME> (...) LOCATION('pxf://<namenode>/<path>? PROFILE=GEMFIRE & LOCATORS=<gemfire-server:port> & REGION=<region-name>') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
44
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
public class GemfireFragmenter extends Fragmenter { public GemfireFragmenter(InputData inputData) throws Exception { super(inputData); }
public List<Fragment> getFragments() throws Exception { return Arrays.asList (new Fragment( inputData.getDataSource(), new String[] { inputData.getUserProperty("URL-HOST") }, new byte[0])); }}
45
Geode Fragmenter
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
public class GemfireAccessor extends Plugin implements ReadAccessor { public GemfireAccessor(InputData input) throws Exception { super(input); }
public boolean openForRead() throws Exception { … } public OneRow readNextObject() throws Exception {…}
public void closeForRead() throws Exception {…}}
46
Geode Accessor
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
public class GemfireResolver extends Plugin implements ReadResolver {
public GemfireResolver(InputData input) { super(input); } public List<OneField> getFields(OneRow paramOneRow) throws Exception { }}
47
Geode Resolver
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Demo 3: Federated queries with PXF/Geode
48
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/
Demo 3: Federated queries with PXF/Geode
49
Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/ 50
By Christian Tzolov @christzolov
HADOOP WORKFLOWS AND DISTRIBUTED YARN APPS USING SPRING TECHNOLOGIES
Learn More. Stay Connected.
@springcentral Spring.io/video