intro to big data choco devday - 23-01-2014

Big DataAskhat Murzabayev

Intro to Big DataAskhat Murzabayev

Explicit attempt of self promotion • 23 years old

• Suleyman Demirel University, BSc in CS 2012

• Chocomart.kz - Apr 2013 - present

• Product Manager at Twitter Dec 2012 - May 2013(Twitter API & Android app)

• SE at Twitter Sep 2011 - Dec 2012(Search, Relevance and Machine Learning dept.)

• Sold diploma thesis(CV algorithm) to Microsoft, used in Bing Maps

• Sold image processing algorithm(better pattern recognition of objects) to Microsoft Research

• Scalable Machine Learning algorithms are my passion

Numbers• 1 zettabyte = 1,000,000 petabytes

• 2006 - 0.18 zettabytes

• 2011 - 1.8 zettabytes

• 2012 - 2.8 zettabytes(3% analyzed)

• Estimate: 2020 - 40 zettabytes

Numbers Everyone Should Know

• Numbers Everyone Should Know

• L1 cache reference 0.5 ns

• Branch mispredict 5 ns

• L2 cache reference 7 ns

• Mutex lock/unlock 100 ns

• Main memory reference 100 ns

• Compress 1K bytes with Zippy 10,000 ns

• Send 2K bytes over 1 Gbps network 20,000 ns

• Read 1 MB sequentially from memory 0.25 ms

Numbers Everyone Should Know part 2

• Round trip within same datacenter 0.5 ms

• Disk seek 10 ms

• Read 1 MB sequentially from network 10 ms

• Read 1 MB sequentially from disk 30 ms

• Send packet CA->Netherlands->CA 150 ms

• Send package via Kazpost - everlasting

Conclusion

!

• time(CPU) < time(RAM) < time(Disk) < time(Network)!

• amount(CPU) < amount(RAM) <<< amount(Disk) < amount(Network)

Problem statement• Tons of data

• F*cking tons of data

• We need to process it

• Process it fast

• Idea is to “parallelize” processing of data

The “Joys” of Real Hardware

• ~0.5 overheating (power down most machines in <5 mins, ~1-2 days to recover)

• ~1 PDU failure (~500-1000 machines suddenly disappear, ~6 hours to come back)

• ~1 rack-move (plenty of warning, ~500-1000 machines powered down, ~6 hours)

• ~1 network rewiring (rolling ~5% of machines down over 2-day span)

• ~20 rack failures (40-80 machines instantly disappear, 1-6 hours to get back)

• ~5 racks go wonky (40-80 machines see 50% packetloss)

• ~8 network maintenances (4 might cause ~30-minute random connectivity losses)

• ~12 router reloads (takes out DNS and external vips for a couple minutes)

• ~3 router failures (have to immediately pull traffic for an hour)

• ~dozens of minor 30-second blips for dns

• ~1000 individual machine failures

• ~thousands of hard drive failures

• slow disks, bad memory, misconfigured machines, flaky machines, etc

Problem statement(2)• A lot of data

• Fast processing

• Reliable

• “Cheap”

• Scale

• Wouldn’t require much of hand work

• Should work on many prog.languages, platforms

• Google File System (GFS)

• Distributed filesystem

• Fault tolerant

• MapReduce

• Distributed processing framework

Apache Hadoop

• “Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open source web search engine, itself a part of the Lucene project”

Ecosystem• Apache Hadoop

• Commons

• HDFS (Hadoop Distributed FileSystem)

• MapReduce(v1, v2)

• Apache HBase

• Apache Pig

• Apache Hive

• Apache Zookeeper

• Apache Oozie

• Apache Sqoop

Example

awk processing

MapReduce(0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004…0500001N9+00781+99999999999...)

Input to Map function (1950, 0) (1950, 22) (1950, −11) (1949, 111) (1949, 78)

Output from Map function is input for Reduce function (1949, [111, 78])(1950, [0, 22, −11])

Output from Reduce function (1949, 111) (1950, 22) !

Data locality optimization!

!

!

!

!

• HDFS block size 64 MB(default)

MapReduce dataflow

Combiner Functions• map1

• (1950, 0) (1950, 20) (1950, 10)

• map2

• (1950, 25) (1950, 15)

• reduce

• input: (1950, [0, 20, 10, 25, 15])

• output: (1950, 25)

• job.setCombinerClass(MaxTemperatureReducer.class);

HDFSDesign and Concepts

The Design of HDFS

• HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware.

HDFS is not good fit for:

• Low-latency data access(use HBase instead)

• Lots of small files

• Multiple writers, arbitrary file modifications

HDFS Concepts• Blocks

• Size on “normal” filesystem: 512 bytes

• Size in HDFS: 64 MB

• File in HDFS that is smaller than a single block does not occupy a full block’s worth of underlying storage

Why block size is so large?

• Disk seek time 10 ms

• Transfer rate is 100 MB/s

• Goal is to make the seek time 1% of the transfer time

• We need around 100 MB block size

Why blocks?• A file can be larger than any disk in the network

• Making unit of abstraction a block rather than file simplifies the storage subsystem

• Simplifies storage subsystem(Fixed size of block, it is easy to calculate how many can be stored)

• Eliminating metadata info(Don’t need to store permissions, created time, created user and etc. with block)

Namenodes and Datanodes• Namenode = master

• Manages filesystem namespace(filesystem tree, metadata for dirs and files)

• Namespace image, edit log - stored persistently on disk

• Stores on which datanodes blocks of given file are stored(stored in RAM)

• Datanode = workers(slaves)

• Store and retrieve blocks when needed

Troubles

• If namenode fails - God save us, it hurts…

Solutions

• Hadoop can be configured so that the namenode writes its persistent state to multiple filesystems

• Secondary namenode: main role is to periodically merge the namespace image with the edit log to prevent the edit log from becoming too large.

HDFS Federation(since 2.x)• Allows a cluster to scale by adding namenodes, each of which

manages a portion of the filesystem namespace.

• /user

• /share

• Namenode manages a namespace volume, which is made up of the metadata for the namespace, and a block pool containing all the blocks for the files in the namespace.

• Namespace volumes are independent of each other.

• Block pool storage is not partitioned, datanodes register with each namenode in the cluster and store blocks from multiple block pools.

HDFS High-Availability(since 2.x)• Namenode still is SPOF (Single Point of Failure)

• If fails then, unable to do MR jobs, read/write/list files

• Recovery algorithm(could take 30 mins)

• load its namespace image into memory,

• replay its edit log, and

• receive enough block reports from the datanodes to leave safe mode.

HDFS HA• Switching namenodes could take 1-2 minutes

• The namenodes must use highly available shared storage to share the edit log.

• Datanodes must send block reports to both namenodes because the block mappings are stored in a namenode’s memory, and not on disk.

• Clients must be configured to handle namenode failover, using a mechanism that is transparent to users.

Reading data

Network distance in Hadoop

Writing data

Moving large datasets to HDFS• Apache Flume

• Moving large quantities of streaming data into HDFS. Log data from one system—a bank of web servers and aggregating it in HDFS for later analysis.

• Supports tail, syslog, and Apache log4j

• Apache Sqoop

• Designed for performing bulk imports of data into HDFS from structured data stores, such as relational databases.

• An example of a Sqoop use case is an organization that runs a nightly Sqoop import to load the day’s data from a production database into a Hive data warehouse for analysis.

Parallel Copying with distcp

• % hadoop distcp hdfs://namenode1/foo hdfs://namenode2/bar!

• Will create foo directory inside of bar in namenode2

• Only map jobs, no reducers pass option -m(shows amount of map jobs)

• % hadoop distcp -update hdfs://namenode1/foo hdfs://namenode2/bar/foo!

• % hadoop distcp webhdfs://namenode1:50070/foo webhdfs://namenode2:50070/bar

hdfs://namenode2/bar

Balancer• Only one program at a time

• Utilization is usage over total capacity

• Utilization of every datanode differs from utilization of cluster by no more than THRESHOLD_VALUE

• Calling balancer % start-balancer.sh [OPTIONAL default is 10%] THRESHOLD_VALUE

Hadoop Archives(HAR)

• HDFS stores small files inefficiently.

• Note: Small files do not take up any more disk space than is required to store the raw contents of the file.

• 1 MB file stored with a block size of 128 MB uses 1 MB of disk space, not 128 MB.

• Archiver tool is MapReduce job

• HAR is directory not single file

• % hadoop archive -archiveName files.har /my/files /my

Limitations of HAR

• No compression

• Immutable

• MapReduce split is still inefficient

ThanksQuestions?

intro to big data choco devday - 23-01-2014

Technology

gt

ram

lt

amount

numbers

time

cpu