percona university - everything you need to know about hadoop now

The Leader in Big Data Consulting

www.mammothdata.com | @mammothdataco

Everything You Need To Know About Hadoop Now

{Percona University | Raleigh}

http://www.mammothdata.com

https://twitter.com/mammothdataco

www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco

PaaKow Acquah, Lead Consultant

Joined OSI January 2008

Currently leading a Data Lake design and implementation for a large Californian utility company






Open Software Integrators

Open Software Integrators is a Big Data consulting and services company specializing in Hadoop, Cassandra, MongoDB and other NoSQL technologies. OSI focuses on executive strategy, initial install, design and implementation.

Founded January 2008 by Andrew C. Oliver

Based in downtown Durham, NC

Partnered with Hortonworks, MongoDB, DataStax, Cloudera, Couchbase, Cloudbees & Neo Technology






Overview

What is Hadoop anyhow?

What is Hadoop Good For?

What isn’t it good for?

How do you get data into Hadoop?

How do you get data out of Hadoop?

How do you process data in Hadoop?

How do you analyze data in Hadoop?

How do you secure Hadoop?






But first...

This is an overview talk intended as a roadmap to point you at the most important bits to learn on the

way…

It is not comprehensive training…

It is not an in-depth look at any part of Hadoop

It is a rather high level selective overview of the Hadoop ecosystem






What Is Hadoop Anyhow?





A platform for distributed computing

2011

HDFS

Hive

2012

HDFS

YARN

Hive

HBase

2014HDFSHiveYarnHBaseSparkStormKafkaMahoutSquoopOozie...

Hadoop is...




Hadoop is...

HDFS

Distributed Filesystem similar to Gluster, Ceph, etc.

You can use other distributed filesystems in place of HDFS

Blocks are distributed, and by default duplicated on at least 1 other node

128m default block size

Restful API, CLI tools, third-party tools to “mount” HDFS on Linux (stable), Windows (ymmv),

Mac (?)

DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO IT! EVEN ON THURSDAY!






Hadoop is...

YARNYet another resource negotiatorschedules “work” among nodes, distributes the “processing”

Map Reduce isan APIan algorithm, data is mapped to nodes, the answers are “reduced” to a single answer

Hive isHDFS/Hadoop based data warehousingSQL, JDBC, ODBCTables map to files on HDFSNo updates, deletes, transactions (but coming in “Stinger.next”)






HBase

a column family database

ACID

relatively low-latency

And a whole lot more

Hadoop is...




Hadoop is...

An ecosystem of tools for distributed processing and storage of data.






What Is Hadoop Good For?





What Is Hadoop Good For?

Working with large amounts of data in batch

ETL processing / Data Transformation

Analytics / BI

Integration (Data Lake, Enterprise Data Hub)

Working with streams of data

Events

Log data

Time series or similar data (HBase)






What Is Hadoop Bad At?

What is Hadoop bad at?

Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds to minutes.

Lots of small files (128MB block size = 0 byte files are 128m files)

General DBMS stuff - HBase is a much more “specific” database than MySQL/etc.

High Availability

WHA???

Knox, Oozie, etc all have shaky support if any for HA Namenodes.






How Do You Get Data Into/Out Of Hadoop?





How Do You Get Data Into Hadoop?

How do you get data into Hadoop?

Sqoop it from an RDBMS

Use JDBC or ODBC and push into Hive from an external DB

Push data into Hive with the restful API

Put an extract file onto HDFS with the REST API

process it into Hive directly with a LOAD DATA statement

transform/process it into Hive using PIG

use Java

Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout” for Storm

Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.






How Do You Get Data Out Of Hadoop?

How do you get data out of Hadoop?

Should you be getting it out or should you process it there?

JDBC/ODBC to Hive

HBase can be mounted into Hive

REST APIs for Hive/HDFS

APIs for Kafka, Spark, Storm, etc (subscribe)

HDCP to another HDFS

Mount it with FUSE and use your favorite Linux tool

hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile






How Do You Process Data In Hadoop?





How Do You Process Data In Hadoop?

Map-reduce Java API

Hive supports SQL (soon to be not a subset)

PIG can munge files on HDFS and can work with Hive

Storm and Spark have their own APIs for dealing with events or so-called micro-batches of data

There are numerous toolkits

Mahout - common machine learning algorithms (many not very parallelizable/etc)

MLib - Machine learning built on Spark

GraphX






How Do You Analyze Data In Hadoop?

Most major BI tools now support HadoopTableauPentahoDatameerYour favorite probably here

All that stuff is for l4m3rs, use the command line interface :-)hive -e ‘select * from sometable’pig hdfs://some/dir/myscript.pig

Use RStudio and write some R to predict what sales will be next month (you will be sort of wrong probably)

Use your favorite SQL tool that supports JDBC/ODBCUse Hue






How Do You Secure Hadoop?





How Do You Secure Hadoop?

HDFS supports POSIX (that means Linux-style) filesystem security The most complete security authentication throughout Hadoop is based on Kerberos (yeah I know).You can do it with just straight LDAP too, but it isn’t integrated.Knox supplies “perimeter-based security” for (only):

HiveHDFSOoozieHBaseHCatalog

Supposedly Argus will save us from all of this!






Other Considerations





Cacophony

Disaster RecoveryFalcon (alpha quality)

WorkflowFlume

Schedule/trigger/orchestrate those ETL jobsOozie

Install, configure, monitor HadoopAmbari

Use tables in both Pig and HiveHCatalog






Ambari




Hue




Hue Editing Oozie




REGISTER

file:///usr/lib/pig/piggybank.jar;

define SUBSTRING

org.apache.pig.piggybank.evaluation

.string.SUBSTRING();

rows = load '$FILEPATH' using

org.apache.pig.piggybank.storage.CS

VExcelStorage('\u001a') as (

a0:chararray,

a1:chararray,

a2:chararray,

a3:chararray,

a4:chararray,

a5:chararray,

a6:chararray,

a7:chararray,

a8:chararray,

a9:chararray

row = foreach rows GENERATE

REPLACE((TRIM($0)),'NULL','') as

orderid,


customerid,


customername,


address,


city,


state,


zip,


status,


store row into 'stage.orders' using

Pig Script




Thank you for attending!