percona university - everything you need to know about hadoop now
TRANSCRIPT
The Leader in Big Data Consulting
www.mammothdata.com | @mammothdataco
Everything You Need To Know About Hadoop Now
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
PaaKow Acquah, Lead Consultant
Joined OSI January 2008
Currently leading a Data Lake design and implementation for a large Californian utility company
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Open Software Integrators
Open Software Integrators is a Big Data consulting and services company specializing in Hadoop, Cassandra, MongoDB and other NoSQL technologies. OSI focuses on executive strategy, initial install, design and implementation.
Founded January 2008 by Andrew C. Oliver
Based in downtown Durham, NC
Partnered with Hortonworks, MongoDB, DataStax, Cloudera, Couchbase, Cloudbees & Neo Technology
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Overview
What is Hadoop anyhow?
What is Hadoop Good For?
What isn’t it good for?
How do you get data into Hadoop?
How do you get data out of Hadoop?
How do you process data in Hadoop?
How do you analyze data in Hadoop?
How do you secure Hadoop?
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
But first...
This is an overview talk intended as a roadmap to point you at the most important bits to learn on the
way…
It is not comprehensive training…
It is not an in-depth look at any part of Hadoop
It is a rather high level selective overview of the Hadoop ecosystem
www.mammothdata.com | @mammothdataco
What Is Hadoop Anyhow?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdataco
A platform for distributed computing
2011
HDFS
Hive
2012
HDFS
YARN
Hive
HBase
2014HDFSHiveYarnHBaseSparkStormKafkaMahoutSquoopOozie...
Hadoop is...
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
HDFS
Distributed Filesystem similar to Gluster, Ceph, etc.
You can use other distributed filesystems in place of HDFS
Blocks are distributed, and by default duplicated on at least 1 other node
128m default block size
Restful API, CLI tools, third-party tools to “mount” HDFS on Linux (stable), Windows (ymmv),
Mac (?)
DO NOT PUT YOUR DATA NODES ON A SAN! IT IS WRONG! DO NOT DO IT! EVEN ON THURSDAY!
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
YARNYet another resource negotiatorschedules “work” among nodes, distributes the “processing”
Map Reduce isan APIan algorithm, data is mapped to nodes, the answers are “reduced” to a single answer
Hive isHDFS/Hadoop based data warehousingSQL, JDBC, ODBCTables map to files on HDFSNo updates, deletes, transactions (but coming in “Stinger.next”)
www.mammothdata.com | @mammothdataco
HBase
a column family database
ACID
relatively low-latency
And a whole lot more
Hadoop is...
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Hadoop is...
An ecosystem of tools for distributed processing and storage of data.
www.mammothdata.com | @mammothdataco
What Is Hadoop Good For?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What Is Hadoop Good For?
Working with large amounts of data in batch
ETL processing / Data Transformation
Analytics / BI
Integration (Data Lake, Enterprise Data Hub)
Working with streams of data
Events
Log data
Time series or similar data (HBase)
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
What Is Hadoop Bad At?
What is Hadoop bad at?
Quick jobs - i.e. Hive/Map Reduce setup time is measured in seconds to minutes.
Lots of small files (128MB block size = 0 byte files are 128m files)
General DBMS stuff - HBase is a much more “specific” database than MySQL/etc.
High Availability
WHA???
Knox, Oozie, etc all have shaky support if any for HA Namenodes.
www.mammothdata.com | @mammothdataco
How Do You Get Data Into/Out Of Hadoop?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Get Data Into Hadoop?
How do you get data into Hadoop?
Sqoop it from an RDBMS
Use JDBC or ODBC and push into Hive from an external DB
Push data into Hive with the restful API
Put an extract file onto HDFS with the REST API
process it into Hive directly with a LOAD DATA statement
transform/process it into Hive using PIG
use Java
Message it in there with Kafka, RabbitMQ or similar MQ and custom “spout” for Storm
Use any multitude of APIs that write data into HDFS, HBase, Hive, etc.
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Get Data Out Of Hadoop?
How do you get data out of Hadoop?
Should you be getting it out or should you process it there?
JDBC/ODBC to Hive
HBase can be mounted into Hive
REST APIs for Hive/HDFS
APIs for Kafka, Spark, Storm, etc (subscribe)
HDCP to another HDFS
Mount it with FUSE and use your favorite Linux tool
hadoop fs -cat /path/to/file/on/hdfs |grep stuff > mynewlocalfile
www.mammothdata.com | @mammothdataco
How Do You Process Data In Hadoop?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Process Data In Hadoop?
Map-reduce Java API
Hive supports SQL (soon to be not a subset)
PIG can munge files on HDFS and can work with Hive
Storm and Spark have their own APIs for dealing with events or so-called micro-batches of data
There are numerous toolkits
Mahout - common machine learning algorithms (many not very parallelizable/etc)
MLib - Machine learning built on Spark
GraphX
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Analyze Data In Hadoop?
Most major BI tools now support HadoopTableauPentahoDatameerYour favorite probably here
All that stuff is for l4m3rs, use the command line interface :-)hive -e ‘select * from sometable’pig hdfs://some/dir/myscript.pig
Use RStudio and write some R to predict what sales will be next month (you will be sort of wrong probably)
Use your favorite SQL tool that supports JDBC/ODBCUse Hue
www.mammothdata.com | @mammothdataco
How Do You Secure Hadoop?
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
How Do You Secure Hadoop?
HDFS supports POSIX (that means Linux-style) filesystem security The most complete security authentication throughout Hadoop is based on Kerberos (yeah I know).You can do it with just straight LDAP too, but it isn’t integrated.Knox supplies “perimeter-based security” for (only):
HiveHDFSOoozieHBaseHCatalog
Supposedly Argus will save us from all of this!
www.mammothdata.com | @mammothdataco
Other Considerations
{Percona University | Raleigh}
www.mammothdata.com | @mammothdatacowww.mammothdata.com | @mammothdataco
Cacophony
Disaster RecoveryFalcon (alpha quality)
WorkflowFlume
Schedule/trigger/orchestrate those ETL jobsOozie
Install, configure, monitor HadoopAmbari
Use tables in both Pig and HiveHCatalog
www.mammothdata.com | @mammothdataco
Ambari
www.mammothdata.com | @mammothdataco
Hue
www.mammothdata.com | @mammothdataco
Hue Editing Oozie
www.mammothdata.com | @mammothdataco
REGISTER
file:///usr/lib/pig/piggybank.jar;
define SUBSTRING
org.apache.pig.piggybank.evaluation
.string.SUBSTRING();
rows = load '$FILEPATH' using
org.apache.pig.piggybank.storage.CS
VExcelStorage('\u001a') as (
a0:chararray,
a1:chararray,
a2:chararray,
a3:chararray,
a4:chararray,
a5:chararray,
a6:chararray,
a7:chararray,
a8:chararray,
a9:chararray
row = foreach rows GENERATE
REPLACE((TRIM($0)),'NULL','') as
orderid,
REPLACE((TRIM($1)),'NULL','') as
customerid,
REPLACE((TRIM($2)),'NULL','') as
customername,
REPLACE((TRIM($3)),'NULL','') as
address,
REPLACE((TRIM($4)),'NULL','') as
city,
REPLACE((TRIM($5)),'NULL','') as
state,
REPLACE((TRIM($6)),'NULL','') as
zip,
REPLACE((TRIM($7)),'NULL','') as
status,
REPLACE((TRIM($8)),'NULL','') as
store row into 'stage.orders' using
Pig Script
www.mammothdata.com | @mammothdataco
Thank you for attending!
{Percona University | Raleigh}