bw tech meetup: hadoop and the rise of big data

Hadoop and the Rise of Big Data

February 21, 2013Donald Miner

@donaldpminerDonald.Miner@emc.com

About Don

Hadoop

• Distributed platform up to thousands of nodes• Data storage and application framework• Started at Yahoo!• Open source• Based on a few Google papers (2003, 2004)• Runs on commodity hardware

I’M HERE TO TELL YOU WHY HADOOP IS AWESOME

Hadoop users• Yahoo!• Facebook• eBay• AOL

• Riot Games• ComScore• Twitter• LinkedIn

Hadoop Companies• Cloudera, Hortonworks, EMC/Greenplum, IBM• Numerous startups

Buzzword glossary

• Unstructured & Structured Data• NoSQL• Big Data (volume, velocity, variety)• Data Science• Cloud computing

Hadoop component overview

• Core components:– HDFS (Hadoop Distributed File System)– MapReduce (Data analysis framework)

• Ecosystem– HBase (key-value store)– Pig (high-level data analysis language)– Hive (SQL-like data analysis language)– ZooKeeper (stores metadata)– Other stuff

Use cases

• Text processing– Indexing, counting, processing

• Large-scale reports• Data science• Mixing data sources (data lakes)• Ad targeting• Image/Video/Audio processing• Cybersecurity

• Stores files in folders (that’s it)– Nobody cares what’s in your files

• Chunks large files into blocks (~64MB-1GB)• Blocks are scattered all over the place• 3 replicates of each block (better safe than sorry)• One NameNode (might be sorry)– Knows which computers blocks live on– Knows which blocks belong to which files

• One DataNode per computer (slaves!)– Hosts files

HDFS Demonstration

MapReduce• Analyzes data in HDFS where the data is• Jobs are split into Mappers and Reducers• JobTracker – keeps track of running jobs• TaskTracker – one per computer, executes tasks• Mappers (you code this)– Loads data from HDFS– Filter, transform, parse– Outputs (key, value) pairs

• Reducers (you code this, too)– Groups by the mapper’s output key– Aggregate, count, statistics– Outputs to HDFS

MapReduce Demonstration

Hadoop ecosystem

• HDFS and MapReduce don’t do everything• Pig – high-level language

• Hive – high-level SQL language

• HBase – key/value store

grpd = GROUP logs BY userAgent;counts = FOREACH grpd GENERATE group, AVG(logs.timeMicroSec)/1.0E+06 AS loadTimeSec;byCount = ORDER counts BY loadTimeSec DESC;top = limit byCount 15;

SELECT grp, SUM(col2), COUNT(*) FROM table1 GROUP BY grp;

Cool thing #1: Linear Scalability

• HDFS and MapReduce scale linearly• If you have twice as many computers, things run

twice as fast• If you have twice as much data, things run twice

as slow• If you have twice as many computers, you can

store twice as much data• This stays true (some minor caveats)• DATA LOCALITY!!

Cool thing #2: Schema on Read

LOAD DATA ???? PROFIT!!

Data is parsed/interpreted as it is loaded out of HDFS

What implications does this have?

Before:ETL, schema design, tossing out original data

Keep original data around!Have multiple views of the same data!Store first, figure out what to do with it later!

Cool thing #3: Transparent Parallelism

Network programming?

Inter-process communication?

Threading?

Distributed stuff?

With MapReduce, I DON’T CARE

MapReduceSolution

… I just have to fit my solution into this tiny box

Fault tolerance?

Code deployment?RPC?

Message passing?

Locking?

Data center fires?

Cool thing #4: Cheap

• Commodity hardware (meh)• Open source (people cost more though)• Add more hardware later

How to get started

• Install Hadoop in a Linux VM– Wait how is this helpful?? Hadoop is distributed!

• Use Google (seriously)

• Some prerequisites: Java, Linux, Data, Time

Stuff Hadoop is good at

• Batch processing• Processing lots of data• Outputting lots of data• Storing lots of historical data• Flexible analysis of data• Dealing with unstructured or structured data

Stuff Hadoop is not good at

• Hadoop is a freight truck, not a sports car• Updating data (think “append-only”)• Being easy to use– Java– Administration

• Hadoop is not good storage (don’t throw away your EMC stuff!)

Hadoop and the Rise of Big Data

February 21, 2013Donald Miner

@donaldpminerDonald.Miner@emc.com

QUESTIONS?

bw tech meetup: hadoop and the rise of big data

original data

loads data

data outputting

data think

load data

rise of big data

data center fires

stuff hadoop

Technology

datameer - may 2014 hadoop meetup

integrate hue with your hadoop cluster - yahoo! hadoop...

boston hadoop meetup, april 26 2012

hadoop virtualization extensions hadoop world meetup

sf hadoop users group august 2014 meetup slides

evaluating and deploying sql-on-hadoop...

nyc hadoop meetup - mapr, architecture, philosophy and...

nyc hadoop meetup introduction to h base

atlanta hadoop users meetup 09 21 2016

hadoop architecture meetup

apache hadoop yarn - hortonworks meetup presentation

boston hadoop meetup: presto for the enterprise

oozie high availability (hadoop summit 2014 meetup)

machinelearning spark hadoop user group munich meetup 2016

shug meetup hops hadoop

introduction to hbase -...

meetup - hadoop user group - munich : 2013-05-22

kafka & hadoop - for nyc kafka meetup

whither the hadoop developer experience, june hadoop meetup,...

back to school - st. louis hadoop meetup september 2016