bruno guedes - hadoop real time for dummies - nosql matters paris 2015

Paris 2015

• CTO for Zenika

• In charge of BigData/NoSQL consulting/training

• Trainer

• Pleasant guy

Hadoop reminder1

2

3

HAWQ – SQL on Hadoop

PXF – Accessing sources

4 Demo – Tweets Analytics

HadoopReminder

Storage

• Semi-Structured

• Unstructured

• Large files

• Large amount of data

• Write once, Read many

Process

• Processing large amount of

data in parallel

• Commodity hardware

• Derived from functional

programming

• Provides high-throughput access to data blocks

• Provides limited interface for managing the file system to

allow it to scale

• Creates multiple replicas of each data block

• Distributes them on computers throughout the cluster to

enable reliable and rapid data access

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

namenode

JVM

NameNode

NameNode (single)

• Manages file-system content tree

• Manages file & directory meta-data

• Manages datanodes and blocks they hold

DataNode (multiple)

• Store & retrieve data blocks (64Mb/128Mb)

• Report block usage to namenode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

namenode

JVM

NameNode

A

D B

A

CB B C

C

D

A

D

File: input/logfiles/2014-12-12.log (200 mb)

Requires 4 blocks (A,B,C,D) spread across data nodesReplicated again on blocks: 77,88,10,20Replicated on blocks: 33,99,55,111Stored on blocks: 11,22,44,66

namenode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

datanode

JVM

DataNode

JVM

NameNode

A

D B

A

CB B C

C

D

A

DFAILURE

• Performs distributed data processing using the

MapReduce programming paradigm

• Allows to possess user-defined map phase, which is a

parallel, share-nothing processing of input

• Creates multiple replicas of each data block

• Distributes them on computers throughout the cluster to

enable reliable and rapid data access

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

TaskTracker

JVM

TaskTracker

JobTracker

JVM

JobTracker

JobTracker (single)

• Launch and manages Jobs

TaskTracker (multiple)

• Run individual tasks (Mappers/Reducers)

• Reside on the DataNodes

HawqSQL on Hadoop

• HAdoop With Queries ?

• Implementation of PostgreSQL

• Use HDFS

• Alternative to HIVE querying

• Supports ANSI SQL-92 and analytic extensions from

SQL-2003

• Cost-based parallel query optimiser

• ODP – Standardize Hadoop Ecosystem

• ODP Core for building a versionned, packaged,

tested set of Hadoop components

• Developing a platform

• Pivotal and Hortonworks alliance to simplify adoption

• Joint engineering efforts

• Support services

• HAWQ Open Sourced

Network

Master

Worker Worker Worker

Network Interconnect

HAWQ Master

Local TM

Query Executor

Parser Query Optimizer

Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

HAWQ Master

Local TM

Query Executor


Dispatch

PXF

• Based on PostgreSQL

• Handle SQL commands

• Maintains global system catalog

• Contains no Data

HAWQ Segment Host

Query Executor

PXF• Process partition of Query

• Based on PostgreSQL

• Stateless

• Manage communication with NameNode

• User/Table data stored in HDFS files


HAWQ Master

Local TM

Query Executor


Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Clients

JDBC/ODBC

SQL


HAWQ Master

Local TM

Query Executor


Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Gather Motion

Sort

HashAggregate

HashJoin

Redistribute Motion

HashJoin

Seq Scan on lineitem

Hash

Seq Scan on orders

Hash

HashJoin

Seq Scan on customer

Hash

Broadcast Motion

Seq Scan on nation


HAWQ Master

Local TM

Query Executor


Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

ScanBars

b

HashJoinb.name = s.bar

ScanSells

s Filterb.city = 'San Francisco'

Projects.beer, s.price

MotionGather

MotionRedist(b.name)

ScanBars

b


ScanSells



MotionGather


ScanBars

b


ScanSells



MotionGather



HAWQ Master

Local TM

Query Executor


Dispatch

PXF

HAWQ Segment Host

Query Executor

PXF

HDFS NameNode

HAWQ Segment Host

Query Executor

PXF

HAWQ Segment Host

Query Executor

PXF

Pivotal HD

0 20 40 60 80 100 120

111 / 111

20 / 111

31 / 111

http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

http://www.tpc.org/tpcds/spec/tpcds_1.1.0.pdf

PXFAccessing sources

• Allows access to Hadoop (HDFS files, HBase, Hive) as

external tables

• Allows joins between HAWQ (internal) & external tables

• Integrating with Third Party systems (Cassandra,

Accumulo)

• Provides extensible framework API to enable custom

development for other data sources

HDFS HBase Hive

Xtension Framework

Fragmenter

• Get the locations of fragments for a table

Accessor

• Understand and read fragment, return records to

the Resolver

Resolver

• Convert the records into a SQL engine format

Analyser

• Provide source stats to the Query optimizer

DemoTweets Analytics

PHD (or any ODP Core-based Hadoop Distribution)

HDFS

HAWQ

(SQL on Hadoop)

SpringXD

(Stream Processing/scoring)

Direct Store

https://github.com/spring-projects/spring-xd-samples/tree/master/analytics-dashboard

https://github.com/spring-projects/spring-xd-samples/tree/master/analytics-dashboard

HDFS

Xtension Framework (Json-ext)

http://pivotal-field-engineering.github.io/pxf-field/json.html

http://pivotal-field-engineering.github.io/pxf-field/json.html

stream create tweets --definition "twitterstream | hdfs --idleTimeout=3000 --fileExtension=json"

stream create tweetlang --definition "tap:stream:tweets > field-value-counter --fieldName=lang" --deploy

stream create tweetcount --definition "tap:stream:tweets > aggregate-counter" --deploy

stream create tagcount --definition "tap:stream:tweets > field-value-counter --fieldName=entities.hashtags.text --name=hashtags" --deploy

stream deploy tweets

<profile>

<name>JSON</name>

<description>A profile for JSON data, one JSON record per line</description>

<plugins>

<fragmenter>com.pivotal.pxf.plugins.hdfs.HdfsDataFragmenter</fragmenter>

<accessor>com.pivotal.pxf.plugins.json.JsonAccessor</accessor>

<resolver>com.pivotal.pxf.plugins.json.JsonResolver</resolver>

<analyzer>com.pivotal.pxf.plugins.hdfs.HdfsAnalyzer</analyzer>

</plugins>

</profile>

CREATE EXTERNAL TABLE ext_tweets_json (created_at TEXT, id_str TEXT, text TEXT, source TEXT, "user.id" INTEGER, "user.location" TEXT, "coordinates.coordinates[0]" DOUBLE PRECISION, "coordinates.coordinates[1]" DOUBLE PRECISION) LOCATION('pxf://pivhdsne:50070/xd/tweets/*.json?PROFILE=JSON') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

bruno guedes - hadoop real time for dummies - nosql matters paris 2015

Software

data nodesreplicated

distributed data processing

namenode usertable data

file directory metadata

parallel query optimiser

hadoop hadoop

file system

multiple replicas