bruno guedes - hadoop real time for dummies - nosql matters paris 2015
TRANSCRIPT
Paris 2015
• CTO for Zenika
• In charge of BigData/NoSQL consulting/training
• Trainer
• Pleasant guy
Hadoop reminder1
2
3
HAWQ – SQL on Hadoop
PXF – Accessing sources
4 Demo – Tweets Analytics
HadoopReminder
Storage
• Semi-Structured
• Unstructured
• Large files
• Large amount of data
• Write once, Read many
Process
• Processing large amount of
data in parallel
• Commodity hardware
• Derived from functional
programming
• Provides high-throughput access to data blocks
• Provides limited interface for managing the file system to
allow it to scale
• Creates multiple replicas of each data block
• Distributes them on computers throughout the cluster to
enable reliable and rapid data access
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
namenode
JVM
NameNode
NameNode (single)
• Manages file-system content tree
• Manages file & directory meta-data
• Manages datanodes and blocks they hold
DataNode (multiple)
• Store & retrieve data blocks (64Mb/128Mb)
• Report block usage to namenode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
namenode
JVM
NameNode
A
D B
A
CB B C
C
D
A
D
File: input/logfiles/2014-12-12.log (200 mb)
Requires 4 blocks (A,B,C,D) spread across data nodesReplicated again on blocks: 77,88,10,20Replicated on blocks: 33,99,55,111Stored on blocks: 11,22,44,66
namenode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
datanode
JVM
DataNode
JVM
NameNode
A
D B
A
CB B C
C
D
A
DFAILURE
• Performs distributed data processing using the
MapReduce programming paradigm
• Allows to possess user-defined map phase, which is a
parallel, share-nothing processing of input
• Creates multiple replicas of each data block
• Distributes them on computers throughout the cluster to
enable reliable and rapid data access
TaskTracker
JVM
TaskTracker
TaskTracker
JVM
TaskTracker
TaskTracker
JVM
TaskTracker
TaskTracker
JVM
TaskTracker
TaskTracker
JVM
TaskTracker
TaskTracker
JVM
TaskTracker
JobTracker
JVM
JobTracker
JobTracker (single)
• Launch and manages Jobs
TaskTracker (multiple)
• Run individual tasks (Mappers/Reducers)
• Reside on the DataNodes
HawqSQL on Hadoop
• HAdoop With Queries ?
• Implementation of PostgreSQL
• Use HDFS
• Alternative to HIVE querying
• Supports ANSI SQL-92 and analytic extensions from
SQL-2003
• Cost-based parallel query optimiser
• ODP – Standardize Hadoop Ecosystem
• ODP Core for building a versionned, packaged,
tested set of Hadoop components
• Developing a platform
• Pivotal and Hortonworks alliance to simplify adoption
• Joint engineering efforts
• Support services
• HAWQ Open Sourced
Network
Master
Worker Worker Worker
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
• Based on PostgreSQL
• Handle SQL commands
• Maintains global system catalog
• Contains no Data
HAWQ Segment Host
Query Executor
PXF• Process partition of Query
• Based on PostgreSQL
• Stateless
• Manage communication with NameNode
• User/Table data stored in HDFS files
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
Clients
JDBC/ODBC
SQL
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
Gather Motion
Sort
HashAggregate
HashJoin
Redistribute Motion
HashJoin
Seq Scan on lineitem
Hash
Seq Scan on orders
Hash
HashJoin
Seq Scan on customer
Hash
Broadcast Motion
Seq Scan on nation
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
ScanBars
b
HashJoinb.name = s.bar
ScanSells
s Filterb.city = 'San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ScanBars
b
HashJoinb.name = s.bar
ScanSells
s Filterb.city = 'San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
ScanBars
b
HashJoinb.name = s.bar
ScanSells
s Filterb.city = 'San Francisco'
Projects.beer, s.price
MotionGather
MotionRedist(b.name)
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
Network Interconnect
HAWQ Master
Local TM
Query Executor
Parser Query Optimizer
Dispatch
PXF
HAWQ Segment Host
Query Executor
PXF
HDFS NameNode
HAWQ Segment Host
Query Executor
PXF
HAWQ Segment Host
Query Executor
PXF
Pivotal HD
0 20 40 60 80 100 120
111 / 111
20 / 111
31 / 111
http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf
PXFAccessing sources
• Allows access to Hadoop (HDFS files, HBase, Hive) as
external tables
• Allows joins between HAWQ (internal) & external tables
• Integrating with Third Party systems (Cassandra,
Accumulo)
• Provides extensible framework API to enable custom
development for other data sources
HDFS HBase Hive
Xtension Framework
Fragmenter
• Get the locations of fragments for a table
Accessor
• Understand and read fragment, return records to
the Resolver
Resolver
• Convert the records into a SQL engine format
Analyser
• Provide source stats to the Query optimizer
DemoTweets Analytics
PHD (or any ODP Core-based Hadoop Distribution)
HDFS
HAWQ
(SQL on Hadoop)
SpringXD
(Stream Processing/scoring)
Direct Store
https://github.com/spring-projects/spring-xd-samples/tree/master/analytics-dashboard
HDFS
Xtension Framework (Json-ext)
http://pivotal-field-engineering.github.io/pxf-field/json.html
stream create tweets --definition "twitterstream | hdfs --idleTimeout=3000 --fileExtension=json"
stream create tweetlang --definition "tap:stream:tweets > field-value-counter --fieldName=lang" --deploy
stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" --deploy
stream create tagcount --definition "tap:stream:tweets > field-value-counter --fieldName=entities.hashtags.text --name=hashtags" --deploy
stream deploy tweets
<profile>
<name>JSON</name>
<description>A profile for JSON data, one JSON record per line</description>
<plugins>
<fragmenter>com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>
<accessor>com.pivotal.pxf.plugins.json.JsonAccessor</accessor>
<resolver>com.pivotal.pxf.plugins.json.JsonResolver</resolver>
<analyzer>com.pivotal.pxf.plugins.hdfs.HdfsAnalyzer</analyzer>
</plugins>
</profile>
CREATE EXTERNAL TABLE ext_tweets_json (created_at TEXT, id_str TEXT, text TEXT, source TEXT, "user.id" INTEGER, "user.location" TEXT, "coordinates.coordinates[0]" DOUBLE PRECISION, "coordinates.coordinates[1]" DOUBLE PRECISION) LOCATION('pxf://pivhdsne:50070/xd/tweets/*.json?PROFILE=JSON') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');