introduction to hadoop

46
Hadoop – Taming Big Data Jax ArcSig, June 2012 Ovidiu Dimulescu

Upload: odimulescu

Post on 27-Jan-2015

777 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Introduction to Hadoop

Hadoop  –  Taming  Big  Data  Jax  ArcSig,  June  2012  

Ovidiu  Dimulescu  

Page 2: Introduction to Hadoop

About  @odimulescu  

•  Working  on  the  Web  since  1997  •  Likes  stuff  well  done  •  Into  engineering  cultures  and  all  around  automaEon  •  Speaker  at  local  user  groups  •  Organizer  for  the  local  Mobile  User  Group  jaxmug.com  

Page 3: Introduction to Hadoop

Agenda  

•  IntroducEon  • Use  cases  • Architecture  • MapReduce  Examples  

• Q&A  

Page 4: Introduction to Hadoop

What  is                                          ?  

•  Apache  Hadoop  is  an  open  source  Java  soSware  framework  for  running  data-­‐intensive  applicaEons  on  large  clusters  of  commodity  hardware  

•  Created  by  Doug  CuVng  (Lucene  &  Nutch  creator)  

•  Named  aSer  Doug’s  son’s  toy  elephant  

Page 5: Introduction to Hadoop

What  and  how  is  solving?    

•  Processing  diverse  large  datasets  in  pracAcal  Ame  at  low  cost  

 •  Consolidates  data  in  a  distributed  file  system  

•  Moves  computaAon  to  data  rather  then  data  to  computaEon  

•  Simpler  programming  model  

Page 6: Introduction to Hadoop

Why  does  it  maEer?    

•  Volume,  Velocity,  Variety  and  Value  

•  Datasets  do  not  fit  on  local  HDDs  let  alone  RAM  

•  Data  grows  at  tremendous  pace  

•  Data  is  heterogeneous    •  Scaling  up  is  expensive  (licensing,  cpus,  disks,  interconnects,  etc.)  

•  Scaling  up  has  a  ceiling  (physical,  technical,  etc.)  

Page 7: Introduction to Hadoop

Why  does  it  maEer?  

80%  

20%  

Data  types  

Complex   Structured  

Complex  Data    

Images,  Video  Logs  Documents  Call  records  Sensor  data  Mail  archives  

 Structured  Data    

User  Profiles  CRM  HR  Records  

*  Chart  Source:  IDC  White  Paper  

Page 8: Introduction to Hadoop

Why  does  it  maEer?    

•  Need  to  process  a  10TB  dataset  

•  Assume  sustained  transfer  of  75MB/s  

•  On  1  node  -­‐  Scanning  data  ~  2  days    

•  On  10  node  cluster  -­‐  Scanning  data  ~  5  hrs  

•  Low  $/TB  for  commodity  drives  

•  Low-­‐end  servers  are  mulEcore  capable  

Page 9: Introduction to Hadoop

Use  Cases    

•  ETL  -­‐  Extract  Transform  Load  

•  RecommendaEon  Engines  

•  Customer  Churn  Analysis    •  Ad  TargeEng    •  Data  “sandbox”  

Page 10: Introduction to Hadoop

Use  Cases  -­‐  Typical  ETL  

Live  DB  

ReporAng  DB  

ETL  1  

BI  ApplicaAons  

Data  Warehouse  

ETL  2  

Logs  

Page 11: Introduction to Hadoop

Use  Cases  -­‐  Hadoop  ETL  

Live  DB  

ReporAng  DB  

BI  ApplicaAons  

Data  Warehouse  

Hadoop  Data  Loading  

Logs  

Data  Loading  

Page 12: Introduction to Hadoop

Use  Cases  –  Analysis  methods  

•  Pakern  recogniEon  

•  Index  building  

•  Text  mining  

•  CollaboraEve  filtering  

•  PredicEon  models  

•  SenEment  analysis  

•  Graphs  creaEon  and  traversal  

Page 13: Introduction to Hadoop

Who  uses  it?  

Page 14: Introduction to Hadoop

Who  supports  it?  

Page 15: Introduction to Hadoop

Why  use  Hadoop?  

•  PracEcal  to  do  things  that  were  previously  not  

ü  Shorter  execuEon  Eme    ü  Costs  less  

ü  Simpler  programming  model    •  Open  system  with  greater  flexibility  

•  Large  and  growing  ecosystem  

Page 16: Introduction to Hadoop

Hadoop  –  Silver  bullet?  

•  Not  a  database  replacement  

•  Not  a  data  warehousing  (complements  it)  

•  Not  for  interacEve  reporEng    •  Not  a  general  purpose  storage  mechanism  

•  Not  for  problems  that  are  not  parallelizable  in  a  share-­‐nothing  fashion  

Page 17: Introduction to Hadoop

Architecture  –  Design  Axioms  

•  System  Shall  Manage  and  Heal  Itself  

•  Performance  Shall  Scale  Linearly    

•  Compute  Should  Move  to  Data    •  Simple  Core,  Modular  and  Extensible  

Page 18: Introduction to Hadoop

Architecture  –  Core  Components  

HDFS    

Distributed  filesystem  designed  for  low  cost  storage  and  high  bandwidth  access  across  the  cluster.  

Map-­‐Reduce    

Programming  model  for  processing  and  generaEng  large  data  sets.  

Page 19: Introduction to Hadoop

Architecture  –  Official  Extensions  

HDFS   HBase  

Storage  

MapReduce  Framework  

Data  Processing  

ZooKeeper   Chukwa  

Management  

Pig  (Data  Flow)   Avro  

Data  Access  

Hive  (SQL)  

Page 20: Introduction to Hadoop

Architecture  –  CDH  DistribuAon  

1.  CDH  –  Cloudera’s  DistribuEon  of  Hadoop  2.  Image  credit  -­‐  Cloudera  presentaEon  @  Microstrategy  World  2011  

Page 21: Introduction to Hadoop

HDFS  -­‐  Design  

•  Based  on  Google’s  GFS  

•  Files  are  stored  as  blocks  (64MB  default  size)    

•  Configurable  data  replicaEon  (3x,  Rack  Aware)    

•  Fault  Tolerant,  Expects  HW  failures  

•  HUGE  files,  Expects  Streaming  not  Low  Latency  

•  Mostly  WORM  

Page 22: Introduction to Hadoop

HDFS  -­‐  Architecture  

Namenode  (NN)  

Datanode  1   Datanode  2   Datanode  N  

Namenode  -­‐  Master    •  Filesystem  metadata  •  Controls  read/write  to  files  •  Manages  blocks  replicaEon  •  Applies  transacEon  log  on  startup    

Datanode  -­‐  Slaves    •  Reads  /  Write  blocks  to/from  clients  •  Replicates  blocks  at  master’s  request  

H  D  F  S  

Client  ask  NN  for  file    NN  returns  DNs  that  host  it    Client  ask  DN  for  data  

Page 23: Introduction to Hadoop

HDFS  –  Fault  tolerance  

•  DataNode    

§  Uses  CRC  to  avoid  corrupEon  §  Data  is  replicated  on  other  nodes  (3x)  

 •  NameNode    

§  Checkpoint  NameNode  §  Backup  NameNode    §  Failover  is  manual  

Page 24: Introduction to Hadoop

MapReduce  -­‐  Design  

•  Based  on  Google’s  MR  paper  •  Borrows  from  funcEonal  programming  •  Simpler  programming  model    

§  map  (in_key,  in_value)    -­‐>  (out_key,  intermediate_value)  list  

§  reduce  (out_key,  intermediate_value  list)    -­‐>  out_value  list    

•  No  user  synchronizaEon  and  coordinaEon  

Input  -­‐>  Map  -­‐>  Reduce  -­‐>  Output  

Page 25: Introduction to Hadoop

MapReduce  -­‐  Architecture  

JobsTracker  (JT)  

TaskTracker  1  

JobTracker  -­‐  Master    •  Accepts  MR  jobs  submiked  by  clients  •  Assigns  Map  and  Reduce  tasks  to  

TaskTrackers,  data  locality  aware  •  Monitors  tasks  and  TaskTracker  status,  

re-­‐executes  tasks  upon  failure  •  SpeculaEve  execuEon  

TaskTracker  -­‐  Slaves    •  Run  Map  and  Reduce  tasks  received  

from  Jobtracker    •  Manage  storage  and  transmission  of  

intermediate  output    

J  O  B  S    API  

Client  launches  a  job    -­‐  ConfiguraEon  -­‐  Mapper  -­‐  Reducer  -­‐  Input  -­‐  Output   TaskTracker  2   TaskTracker  N  

Page 26: Introduction to Hadoop

Hadoop  -­‐  Core  Architecture  

JobsTracker  

TaskTracker  1  DataNode        1  

J  O  B  S    API  

NameNode  

TaskTracker  2  DataNode        2  

TaskTracker  N  DataNode        N  

H  D  F  S  

Mini  OS  •  File  system  •  Scheduler  

Page 27: Introduction to Hadoop

hkp://www.slideshare.net/esaliya/mapreduce-­‐in-­‐simple-­‐terms  

MapReduce  –  Head  First  Style  

Page 28: Introduction to Hadoop

MapReduce  –  Mapper  Types  

One-­‐to-­‐One  map(k,  v)  =  emit  (k,  transform(v))    

Exploder  map(k,  v)  =  foreach  p  in  v:  emit  (k,  p)    

Filter  map(k,  v)  =  if  cond(v)  then  emit  (k,  v)  

Page 29: Introduction to Hadoop

MapReduce  –  Reducer  Types  

Sum  Reducer    

reduce(k,  vals)  =    sum  =  0  foreach  v  in  vals:  sum  +=  v  emit  (k,  sum)  

 

Page 30: Introduction to Hadoop

MapReduce  –  High  level  pipeline  

K1  

K1  

K1  

K1  

K2  

K2  

K2  

K2  

Page 31: Introduction to Hadoop

MapReduce  –  Detailed  pipeline  

Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  

Page 32: Introduction to Hadoop

MapReduce  –  Combiner  Phase  

•  OpEonal  •  Runs  on  mapper  nodes  aSer  map  phase    •  “  Mini-­‐reduce,”  only  on  local  map  output    •  Used  to  save  bandwidth  before  sending  data  to  full  reducer    •  The  Reducer  can  be  Combiner  if    

1.  Output  key,  values  are  the  same  as  input  key,  values  2.  CommutaEve  and  AssociaEve  (SUM,  MAX  ok  but  AVG  not)  

Diagram:  hkp://developer.yahoo.com/hadoop/tutorial/module4.html  

Page 33: Introduction to Hadoop

InstallaAon  

1.  Download  &  configure  single-­‐node  cluster  

hadoop.apache.org/common/releases.html  

2.  Download  a  demo  VM    

Cloudera  Hortonwork  

3.  Use  a  hosted  environment  (Amazon’s  EMR,  Azure)  

Page 34: Introduction to Hadoop

InstallaAon  –  Pla[orm  Notes  

ProducAon        Linux  –  Official  

 Development    

   Linux      OSX      Windows  via  Cygwin      *Nix  

Page 35: Introduction to Hadoop

MapReduce  –  Client  Languages  

Java,  Any  JVM  Languages  -­‐  NaEve      C++  -­‐  Pipes  framework  –  Socket  IO      Any  –  Streaming  –  Stdin  /  Stdout  

       

Pig  LaEn,  Hive  HQL,  C  via  JNI  

hadoop  pipes  -­‐input  path_in  -­‐output  path_out  -­‐program  exec_program  

hadoop  jar  hadoop-­‐streaming.jar  -­‐mapper  map_prog  -­‐reducer  reduce_prog  -­‐input  path_in  -­‐output  path_out  

hadoop  jar  jar_path  main_class  input_path  output_path  

Page 36: Introduction to Hadoop

MapReduce  –  Client  Anatomy  

•  Main  Program  (aka  Driver)    

 Configures  the  Job    IniEates  the  Job  

•  Input  LocaEon  •  Mapper  •  Combiner  (opEonal)  •  Reducer  •  Output  LocaEon  

Page 37: Introduction to Hadoop

MapReduce  –  Word  Count  Example  

Page 38: Introduction to Hadoop

MapReduce  –  C#  Mapper  

Page 39: Introduction to Hadoop

MapReduce  –  C#  Reducer  

Page 40: Introduction to Hadoop

MapReduce  –  Java  Mapper  

Page 41: Introduction to Hadoop

MapReduce  –  Java  Reducer  

Page 42: Introduction to Hadoop

MapReduce  –  JavaScript  Mapper  

Page 43: Introduction to Hadoop

MapReduce  –  JavaScript  Reducer  

Page 44: Introduction to Hadoop

Summary  

                                                       is  an  economical  scalable  distributed  data  processing  system  which  enables  data:    

ü  ConsolidaAon  (Structured  or  Not)  ü Query  Flexibility  (Any  Language)  ü  Agility  (Evolving  Schemas)  

Page 45: Introduction to Hadoop

QuesAons  ?  

Page 46: Introduction to Hadoop

References  

Hadoop  at  Yahoo!,  by  Y!  Developer  Network    MapReduce  in  Simple  Terms,  by Saliya Ekanayake    Hadoop  Architecture,  by Phillipe Julio    10  Hadoop-­‐able  Problems,  by Cloudera    Hadoop,  An  Industry  PerspecEve,  by Amr Awadallah Anatomy of a MapReduce Job Run by Tom White MapReduceJobs in Hadoop