intro to big data choco devday - 23-01-2014

48
Big Data Askhat Murzabayev

Upload: hassan-islamov

Post on 30-Oct-2014

443 views

Category:

Technology


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Intro to big data   choco devday - 23-01-2014

Big DataAskhat Murzabayev

Page 2: Intro to big data   choco devday - 23-01-2014

Intro to Big DataAskhat Murzabayev

Page 3: Intro to big data   choco devday - 23-01-2014

Explicit attempt of self promotion • 23 years old

• Suleyman Demirel University, BSc in CS 2012

• Chocomart.kz - Apr 2013 - present

• Product Manager at Twitter Dec 2012 - May 2013(Twitter API & Android app)

• SE at Twitter Sep 2011 - Dec 2012(Search, Relevance and Machine Learning dept.)

• Sold diploma thesis(CV algorithm) to Microsoft, used in Bing Maps

• Sold image processing algorithm(better pattern recognition of objects) to Microsoft Research

• Scalable Machine Learning algorithms are my passion

Page 4: Intro to big data   choco devday - 23-01-2014
Page 5: Intro to big data   choco devday - 23-01-2014

Numbers• 1 zettabyte = 1,000,000 petabytes

• 2006 - 0.18 zettabytes

• 2011 - 1.8 zettabytes

• 2012 - 2.8 zettabytes(3% analyzed)

• Estimate: 2020 - 40 zettabytes

Page 6: Intro to big data   choco devday - 23-01-2014

Numbers Everyone Should Know

• Numbers Everyone Should Know

• L1 cache reference 0.5 ns

• Branch mispredict 5 ns

• L2 cache reference 7 ns

• Mutex lock/unlock 100 ns

• Main memory reference 100 ns

• Compress 1K bytes with Zippy 10,000 ns

• Send 2K bytes over 1 Gbps network 20,000 ns

• Read 1 MB sequentially from memory 0.25 ms

Page 7: Intro to big data   choco devday - 23-01-2014

Numbers Everyone Should Know part 2

• Round trip within same datacenter 0.5 ms

• Disk seek 10 ms

• Read 1 MB sequentially from network 10 ms

• Read 1 MB sequentially from disk 30 ms

• Send packet CA->Netherlands->CA 150 ms

• Send package via Kazpost - everlasting

Page 8: Intro to big data   choco devday - 23-01-2014

Conclusion

!

• time(CPU) < time(RAM) < time(Disk) < time(Network)!

• amount(CPU) < amount(RAM) <<< amount(Disk) < amount(Network)

Page 9: Intro to big data   choco devday - 23-01-2014

Problem statement• Tons of data

• F*cking tons of data

• We need to process it

• Process it fast

• Idea is to “parallelize” processing of data

Page 10: Intro to big data   choco devday - 23-01-2014

The “Joys” of Real Hardware

• ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

• ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

• ~1 network rewiring (rolling ~5% of machines down over 2-day span)

Page 11: Intro to big data   choco devday - 23-01-2014

• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

• ~5 racks go wonky (40-80 machines see 50% packetloss)

• ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

• ~12 router reloads (takes out DNS and external vips for a couple minutes)

• ~3 router failures (have to immediately pull traffic for an hour)

• ~dozens of minor 30-second blips for dns

• ~1000 individual machine failures

• ~thousands of hard drive failures

• slow disks, bad memory, misconfigured machines, flaky machines, etc

Page 12: Intro to big data   choco devday - 23-01-2014

Problem statement(2)• A lot of data

• Fast processing

• Reliable

• “Cheap”

• Scale

• Wouldn’t require much of hand work

• Should work on many prog.languages, platforms

Page 13: Intro to big data   choco devday - 23-01-2014

• Google File System (GFS)

• Distributed filesystem

• Fault tolerant

• MapReduce

• Distributed processing framework

Page 14: Intro to big data   choco devday - 23-01-2014

Apache Hadoop

• “Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project”

Page 15: Intro to big data   choco devday - 23-01-2014

Ecosystem• Apache Hadoop

• Commons

• HDFS (Hadoop Distributed FileSystem)

• MapReduce(v1, v2)

• Apache HBase

• Apache Pig

• Apache Hive

• Apache Zookeeper

• Apache Oozie

• Apache Sqoop

Page 16: Intro to big data   choco devday - 23-01-2014

Example

Page 17: Intro to big data   choco devday - 23-01-2014

awk processing

Page 18: Intro to big data   choco devday - 23-01-2014

MapReduce(0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004…0500001N9+00781+99999999999...)

Input to Map function (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)

Output from Map function is input for Reduce function (1949, [111, 78])(1950, [0, 22, −11])

Output from Reduce function (1949, 111) (1950, 22) !

Page 19: Intro to big data   choco devday - 23-01-2014
Page 20: Intro to big data   choco devday - 23-01-2014
Page 21: Intro to big data   choco devday - 23-01-2014
Page 22: Intro to big data   choco devday - 23-01-2014
Page 23: Intro to big data   choco devday - 23-01-2014

Data locality optimization!

!

!

!

!

• HDFS block size 64 MB(default)

Page 24: Intro to big data   choco devday - 23-01-2014

MapReduce dataflow

Page 25: Intro to big data   choco devday - 23-01-2014

Combiner Functions• map1

• (1950, 0) (1950, 20) (1950, 10)

• map2

• (1950, 25) (1950, 15)

• reduce

• input: (1950, [0, 20, 10, 25, 15])

• output: (1950, 25)

• job.setCombinerClass(MaxTemperatureReducer.class);

Page 26: Intro to big data   choco devday - 23-01-2014
Page 27: Intro to big data   choco devday - 23-01-2014

HDFSDesign and Concepts

Page 28: Intro to big data   choco devday - 23-01-2014

The Design of HDFS

• HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

Page 29: Intro to big data   choco devday - 23-01-2014

HDFS is not good fit for:

• Low-latency data access(use HBase instead)

• Lots of small files

• Multiple writers, arbitrary file modifications

Page 30: Intro to big data   choco devday - 23-01-2014

HDFS Concepts• Blocks

• Size on “normal” filesystem: 512 bytes

• Size in HDFS: 64 MB

• File in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage

Page 31: Intro to big data   choco devday - 23-01-2014

Why block size is so large?

• Disk seek time 10 ms

• Transfer rate is 100 MB/s

• Goal is to make the seek time 1% of the transfer time

• We need around 100 MB block size

Page 32: Intro to big data   choco devday - 23-01-2014

Why blocks?• A file can be larger than any disk in the network

• Making unit of abstraction a block rather than file simplifies the storage subsystem

• Simplifies storage subsystem(Fixed size of block, it is easy to calculate how many can be stored)

• Eliminating metadata info(Don’t need to store permissions, created time, created user and etc. with block)

Page 33: Intro to big data   choco devday - 23-01-2014

Namenodes and Datanodes• Namenode = master

• Manages filesystem namespace(filesystem tree, metadata for dirs and files)

• Namespace image, edit log - stored persistently on disk

• Stores on which datanodes blocks of given file are stored(stored in RAM)

• Datanode = workers(slaves)

• Store and retrieve blocks when needed

Page 34: Intro to big data   choco devday - 23-01-2014

Troubles

• If namenode fails - God save us, it hurts…

Page 35: Intro to big data   choco devday - 23-01-2014

Solutions

• Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems

• Secondary namenode: main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

Page 36: Intro to big data   choco devday - 23-01-2014

HDFS Federation(since 2.x)• Allows a cluster to scale by adding namenodes, each of which

manages a portion of the filesystem namespace.

• /user

• /share

• Namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace.

• Namespace volumes are independent of each other.

• Block pool storage is not partitioned, datanodes register with each namenode in the cluster and store blocks from multiple block pools.

Page 37: Intro to big data   choco devday - 23-01-2014

HDFS High-Availability(since 2.x)• Namenode still is SPOF (Single Point of Failure)

• If fails then, unable to do MR jobs, read/write/list files

• Recovery algorithm(could take 30 mins)

• load its namespace image into memory,

• replay its edit log, and

• receive enough block reports from the datanodes to leave safe mode.

Page 38: Intro to big data   choco devday - 23-01-2014

HDFS HA• Switching namenodes could take 1-2 minutes

• The namenodes must use highly available shared storage to share the edit log.

• Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.

• Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.

Page 39: Intro to big data   choco devday - 23-01-2014

Reading data

Page 40: Intro to big data   choco devday - 23-01-2014

Network distance in Hadoop

Page 41: Intro to big data   choco devday - 23-01-2014

Writing data

Page 42: Intro to big data   choco devday - 23-01-2014

Moving large datasets to HDFS• Apache Flume

• Moving large quantities of streaming data into HDFS. Log data from one system—a bank of web servers and aggregating it in HDFS for later analysis.

• Supports tail, syslog, and Apache log4j

• Apache Sqoop

• Designed for performing bulk imports of data into HDFS from structured data stores, such as relational databases.

• An example of a Sqoop use case is an organization that runs a nightly Sqoop import to load the day’s data from a production database into a Hive data warehouse for analysis.

Page 43: Intro to big data   choco devday - 23-01-2014

Parallel Copying with distcp

• % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar!

• Will create foo directory inside of bar in namenode2

• Only map jobs, no reducers pass option -m(shows amount of map jobs)

• % hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo!

• % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar

Page 44: Intro to big data   choco devday - 23-01-2014

Balancer• Only one program at a time

• Utilization is usage over total capacity

• Utilization of every datanode differs from utilization of cluster by no more than THRESHOLD_VALUE

• Calling balancer % start-balancer.sh [OPTIONAL default is 10%] THRESHOLD_VALUE

Page 45: Intro to big data   choco devday - 23-01-2014

Hadoop Archives(HAR)

• HDFS stores small files inefficiently.

• Note: Small files do not take up any more disk space than is required to store the raw contents of the file.

• 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.

Page 46: Intro to big data   choco devday - 23-01-2014

• Archiver tool is MapReduce job

• HAR is directory not single file

• % hadoop archive -archiveName files.har /my/files /my

Page 47: Intro to big data   choco devday - 23-01-2014

Limitations of HAR

• No compression

• Immutable

• MapReduce split is still inefficient

Page 48: Intro to big data   choco devday - 23-01-2014

ThanksQuestions?