hadoop introduction 2

Hadoop Introduction IIK-means && Python && Dumbo

Outline

• Dumbo• K-means• Python and Data Mining

12/20/12 2

Hadoop in Python

• Jython: Happy• Cython:

• Pydoop • components(RecordReader， RecordWriter and Partitioner)• Get configuration, set counters and report statuscpython use any module• HDFS API

• Hadoopy: an other Cython

• Streaming:• Dumbo• Other small Map-Reduce wrapper

12/20/12 3

Dumbo

Hadoop in Python

12/20/12 4

Hadoop in Python Extention

12/20/12 5

Hadoop in Python

Integration with Pipes(C++) + Integration with libhdfs(C)

Dumbo• Dumbo is a project that allows you to easily write and

run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs.

• Advantages:• Easy: Dumbo strives to be as Pythonic as possible• Efficient: Dumbo programs communicate with Hadoop in a very

effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind.

• Flexible: We can extend it• Mature

12/20/12 6

Dumbo: Review WordCount

12/20/12 7

Dumbo – Word Count

12/20/12 8

Dumbo IP counts

12/20/12 9

Dumbo IP counts

12/20/12 10

K-means in Map-Reduce• Normal K-means:

• Inputs: a set of n d-dimensional points && a number of desired clusters k.

• Step 1: Random choice K points at the sample of n Points• Step2 : Calculate every point to K initial centers. Choice closest• Step3 : Using this assignation of points to cluster centers, each cluster center is

recalculated as the centroid of its member points.• Step4: This process is then iterated until convergence is reached.• Final: points are reassigned to centers, and centroids recalculated until the k

cluster centers shift by less than some delta value.

• k-means is a surprisingly parallelizable algorithm.

12/20/12 11

K-means in Map-Reduce• Key-points:

• we want to come up with a scheme where we can operate on each point in the data set independently.

• a small amount of shared data (The cluster centers)• when we partition points among MapReduce nodes, we

also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently.

12/20/12 12

Hadoop Phase

• Map:• In : points in the data set• Output : (ClusterID, Point) pair for each point.

Where the ClusterID is the integer Id of the cluster which is cloest to point.

12/20/12 13

Hadoop Phase

• Reduce Phase:• In : (ClusterID, Point)

• Operator:• the outputs of the map phase are grouped by

ClusterID. • for each ClusterID the centroid of the points

associated with that ClusterID is calculated.• Output: (ClusterID, Centroid) pairs. Which represent the

newly calculated cluster centers.

12/20/12 14

External Program• Each iteration of the algorithm is structured as a single

MapReduce job.

• After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job.

12/20/12 15

Write in Dumbo

12/20/12 16

Write in Dumbo

12/20/12 17

Write in Dumbo

12/20/12 18

Results

12/20/12 19

Next

• Write n-times iteration wrapper • Optimize K-means• Result Visualization with Python

12/20/12 20

Optimize• If partial centroids for clusters are computed on the map

nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer.

• We can use Combiner!

12/20/12 21

Dumbo Usage

• Very easy• You can write your own code for Dumbo• Debug easy• Command easy

12/20/12 22

Python and Data Mining

• Books:• 用 Python 行科学算进计• 集体智慧程编• 掘社交网挖络• 用 Python 行自然言理进语处• Think Stats Python与数据分析

12/20/12 23

Python and Data Mining

• Tools• Numpy• Scipy• Orange （利用 orange 行掘）进关联规则挖

12/20/12 24

thanks

12/20/12 25

hadoop introduction 2

Documents