big data and hadoop - history, technical deep dive, and industry trends

40
BIG DATA AND HADOOP Esther Kundin Bloomberg LP History, Technical Deep Dive, and Industry Trends

Upload: esther-kundin

Post on 12-Jul-2015

323 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

BIG DATA AND HADOOP

Esther Kundin

Bloomberg LP

History, Technical Deep Dive, and Industry

Trends

Page 2: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

About Me

Page 3: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data – What is It?

Page 4: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Outline

• What Is Big Data?

• A History Lesson

• Hadoop – Dive in to the details

• HDFS

• MapReduce

• HBase

• Industry Trends

• Questions

Page 5: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

What is Big Data?

Page 6: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

A History Lesson

Page 7: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data Origins

• Indexing the web requires lots of storage

• Petabytes of data!

• Economic problem – reliable servers expensive!

• Solution:

• Cram in as many cheap machines as possible

• Replace them when they fail

• Solve reliability via software!

Page 8: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data Origins Cont’d

• DBs are slow and expensive

• Lots of unneeded features

RDBMS NoSQL

ACID Eventual

consistency

Strongly-typed No type checking

Complex Joins Get/Put

RAID storage Commodity

hardware

Page 9: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data Origins Cont’d

• Google publishes papers about:

• GFS (2000)

• MapReduce (2004)

• BigTable (2006)

• Hadoop, originally developed at Yahoo, accepted as

Apache top-level project in 2008

Page 10: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Translation

GFS HDFS

MapReduce Hadoop MapReduce

BigTable HBASE

Page 11: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Why Hadoop?

• Huge and growing ecosystem of services

• Pace of development is swift

• Tons of money and talent pouring in

Page 12: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Diving into the details!

Page 13: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Hadoop Ecosytem

• HDFS – Hadoop Distributed File System

• Pig: a scripting language that simplifies the creation of MapReducejobs and excels at exploring and transforming data.

• Hive: provides SQL-like access to your Big Data.

• HBase: Hadoop database .

• HCatalog: for defining and sharing schemas .

• Ambari: for provisioning, managing, and monitoring Apache Hadoop clusters .

• ZooKeeper: an open-source server which enables highly reliable distributed coordination .

• Sqoop: for efficiently transferring bulk data between Hadoop and relation databases .

• Oozie: a workflow scheduler system to manage Apache Hadoop jobs

• Mahout : scalable machine learning library

Page 14: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HDFS

• Hadoop Distributed File System

• Basis for all other tools, built on top of it

• Allows for distributed workloads

Page 15: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HDFS details

Page 16: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HDFS Demo

Page 17: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

MapReduce

Page 18: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

MapReduce demo

• To run, can use:

• Custom JAVA application

• PIG – nice interface

• Hadoop Streaming + any executable, like python

• Thanks to: http://www.michael-noll.com/tutorials/writing-an-hadoop-

mapreduce-program-in-python/

• HIVE – SQL over MapReduce – “we put the SQL in NoSQL”

Page 19: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase

• Database running on top of HDFS

• NOSQL – key/value store

• Distributed

• Good for sparse requests, rather than scans like MapReduce

• Sorted

• Eventually Consistent

Page 20: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Architecture

ClientZK Peer

ZK Quorum

ZK Peer

ZK Peer

HDFS

RegionServer RegionServer RegionServer

HMaster

HMaster

Meta Region

Server

Page 21: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Read

ClientZK Peer

ZK Quorum

ZK Peer

ZK Peer

HDFS

RegionServer RegionServer RegionServer

HMaster

HMaster

Meta Region

Server

Client requests Meta

Region Server

address

Page 22: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Architecture

ClientZK Peer

ZK Quorum

ZK Peer

ZK Peer

HDFS

RegionServer RegionServer RegionServer

HMaster

HMaster

Meta Region

Server

Client determines

Which RegionServer

to contact and caches

that data

Page 23: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Architecture

ClientZK Peer

ZK Quorum

ZK Peer

ZK Peer

HDFS

RegionServer RegionServer RegionServer

HMaster

HMaster

Meta Region

Server

Client requests data

from the Region

Server, which gets

data from HDFS

Page 24: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Demo

Page 25: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HMaster

• Only one main master at a time – ensured by zookeeper

• Keeps track of all table metadata

• Used in table creation, modification, and deletion.

• Not used for reads

Page 26: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Region Server

• This is the worker node of HBase

• Performs Gets, Puts, and Scans for the regions it handles

• Multiple regions are handled by each Region Server

• On startup

• Registers with zookeeper

• Hmaster assigns it regions

• Physical blocks on HDFS may or may not be on the same machine

• Regions are split if they get too big

• Data stored in a format called Hfile

• Cache of data is what gives good performance. Cache

based on blocks, not rows

Page 27: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Write – step 1

Region Server

WAL (on

HDFS)

MemStore

HFileHFile

HFile

Region Server

persists write at

the end of the

WAL

Page 28: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Write – step 2

Region Server

WAL (on

HDFS)

MemStore

HFileHFile

HFile

Regions Server

saves write in a

sorted map in

memory in the

MemStore

Page 29: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

HBase Write – offline

Region Server

WAL (on

HDFS)

MemStore

HFileHFile

HFile

When MemStore reaches

a configurable size, it is

flushed to an HFile

Page 30: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Minor Compaction

• When writing a MemStore to Hfile, may trigger a Minor

Compaction

• Combine many small Hfiles into one large one

• Saves disk reads

• May block further MemStore flushes, so try to keep to a

minimum

Page 31: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Major Compaction

• Happens at configurable times for the system

• Ie. Once a week on weekends

• Default to once every 24 hrs

• Resource-intensive

• Don’t set it to “never”

• Reads in all Hfiles and makes sure there is one Hfile per

Region per column family

• Purges deleted records

• Ensures that HDFS files are local

Page 32: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Tuning your DB - HBase Keys

• Row Key – byte array

• Best performance for Single Row Gets

• Best Caching Performance

• Key Design –

• Distributes well – usually accomplished by hashing natural key

• MD5

• SHA1

Page 33: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Tuning your DB - BlockCache

• Each region server has a BlockCache where it stores file

blocks that it has already read

• Every read that is in the block increases performance

• Don’t want your blocks to be much bigger than your rows

• Modes of caching:

• 2-level LRU cache, by default

• Other options: BucketCache – can use DirectByteBuffers to

manage off-heap RAM – better Garbage Collection stats on the

region server

Page 34: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Tuning your DB - Columns and Column

Families• All columns in a column families accessed together for

reads

• Different column families stored in different HFiles

• All Column Families written once when any MemStore is

full

• Example:

• Storing package tracking information:

• Need package shipping info

• Need to store each location in the path

Page 35: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Tuning your DB – Bloom Filters

• Can be set on rows or columns

• Keep an extra index of available keys

• Slows down reads and writes a bit

• Increases storage

• Saves time checking if keys exist

• Turn on if it is likely that client will request missing data

Page 36: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Tuning your DB – Short-Circuit Reads

• HDFS exposes service interface

• If file is actually local, much faster to just read Hfile

directly off of the disk

Page 37: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Current Industry Trends

Page 38: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Big Data in Finance – the challenges

• Real-Time financial analysis

• Reliability

• “medium-data”

Page 39: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

What Bloomberg is Working on

• Working with Hortonworks on fixing real-time issues in

Hadoop

• Creating a framework for reliably serving real-time data

• Presenting at Hadoop World and Hadoop Summit

• Open source Chef recipes for running a hadoop cluster on

OpenStack-managed VMs

Page 40: Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Questions?

• Thank you!