ibm research ® © 2007 ibm corporation a brief overview of hadoop eco-system

13
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco- System

Upload: phebe-loraine-gordon

Post on 19-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research

®

© 2007 IBM Corporation

A Brief Overview of Hadoop Eco-System

Page 2: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Hive SQL-like language to query data stored on HDFS

Example – “Select c.ID, c.Name, c.AGE, o.Amount From Customers c JOIN Orders o on (c.ID = o.CUSTOMER)

Data Model Tables – Column types (int, float, string, data, Boolean)

Supports array / map / struct for Json like data

Meta-Store Name-space containing set of tables, list of columns and their types and SerDe info

CLI

Other languages – Jaql, Pig

Page 3: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

HBase

Hadoop performs only Batch processing. Data will be accessed only in a sequential manner.

One has to search the entire dataset for the simplest of jobs. HBase provides random read/write access to data in HDFS Data Model –

A table is a collection of rows

A row is a collection of column families

A column family is a collection of columns

A column is a collection of key-value pairs

Page 4: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

HBase

Reading – Get and Scan. Reader will always read the last written values

Rows are ordered.

Hbase is not an SQL database, relational, joins, secondary-indices,

Horizontally Scalable

Page 5: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Page 6: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Oozie Workflow management and coordination of these workflows

Workflow consist of Action nodes (MR, Pig, Hive) and Control Nodes. Specified through an xml file

Page 7: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Cascading and Scalding

Page 8: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Word-Count in Java

Page 9: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Apache Mahaout

Page 10: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Cascading

A simple, high-level java API for MR easy to understand and work with

Page 11: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Scalding

The power of scala over cascading

No boilerplate code

Page 12: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Sqoop

Apache Sqoop is designed for efficiently transferring bulk data between Apache Hadoop and RDBMS

Imports data from external structured datastores into HDFS or related systems like Hbase

Page 13: IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System

IBM Research | India Research Lab

Mahout