hadoop introduction 2
TRANSCRIPT
![Page 1: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/1.jpg)
Hadoop Introduction IIK-means && Python && Dumbo
![Page 2: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/2.jpg)
Outline
• Dumbo• K-means• Python and Data Mining
12/20/12 2
![Page 3: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/3.jpg)
Hadoop in Python
• Jython: Happy• Cython:
• Pydoop • components(RecordReader, RecordWriter and Partitioner)• Get configuration, set counters and report statuscpython use any module• HDFS API
• Hadoopy: an other Cython
• Streaming:• Dumbo• Other small Map-Reduce wrapper
12/20/12 3
Dumbo
![Page 4: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/4.jpg)
Hadoop in Python
12/20/12 4
![Page 5: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/5.jpg)
Hadoop in Python Extention
12/20/12 5
Hadoop in Python
Integration with Pipes(C++) + Integration with libhdfs(C)
![Page 6: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/6.jpg)
Dumbo• Dumbo is a project that allows you to easily write and
run Hadoop programs in Python. More generally, Dumbo can be considered a convenient Python API for writing MapReduce programs.
• Advantages:• Easy: Dumbo strives to be as Pythonic as possible• Efficient: Dumbo programs communicate with Hadoop in a very
effecient way by relying on typed bytes, a nifty serialisation mechanism that was specifically added to Hadoop with Dumbo in mind.
• Flexible: We can extend it• Mature
12/20/12 6
![Page 7: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/7.jpg)
Dumbo: Review WordCount
12/20/12 7
![Page 8: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/8.jpg)
Dumbo – Word Count
12/20/12 8
![Page 9: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/9.jpg)
Dumbo IP counts
12/20/12 9
![Page 10: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/10.jpg)
Dumbo IP counts
12/20/12 10
![Page 11: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/11.jpg)
K-means in Map-Reduce• Normal K-means:
• Inputs: a set of n d-dimensional points && a number of desired clusters k.
• Step 1: Random choice K points at the sample of n Points• Step2 : Calculate every point to K initial centers. Choice closest• Step3 : Using this assignation of points to cluster centers, each cluster center is
recalculated as the centroid of its member points.• Step4: This process is then iterated until convergence is reached.• Final: points are reassigned to centers, and centroids recalculated until the k
cluster centers shift by less than some delta value.
• k-means is a surprisingly parallelizable algorithm.
12/20/12 11
![Page 12: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/12.jpg)
K-means in Map-Reduce• Key-points:
• we want to come up with a scheme where we can operate on each point in the data set independently.
• a small amount of shared data (The cluster centers)• when we partition points among MapReduce nodes, we
also distribute a copy of the cluster centers. This results in a small amount of data duplication, but very minimal. In this way each of the points can be operated on independently.
12/20/12 12
![Page 13: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/13.jpg)
Hadoop Phase
• Map:• In : points in the data set• Output : (ClusterID, Point) pair for each point.
Where the ClusterID is the integer Id of the cluster which is cloest to point.
12/20/12 13
![Page 14: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/14.jpg)
Hadoop Phase
• Reduce Phase:• In : (ClusterID, Point)
• Operator:• the outputs of the map phase are grouped by
ClusterID. • for each ClusterID the centroid of the points
associated with that ClusterID is calculated.• Output: (ClusterID, Centroid) pairs. Which represent the
newly calculated cluster centers.
12/20/12 14
![Page 15: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/15.jpg)
External Program• Each iteration of the algorithm is structured as a single
MapReduce job.
• After each phase, our lib reads the output , determines whether convergence has been reached by the calculating by how much distance the clusters have moved. The runs another Mapreduce job.
12/20/12 15
![Page 16: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/16.jpg)
Write in Dumbo
12/20/12 16
![Page 17: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/17.jpg)
Write in Dumbo
12/20/12 17
![Page 18: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/18.jpg)
Write in Dumbo
12/20/12 18
![Page 19: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/19.jpg)
Results
12/20/12 19
![Page 20: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/20.jpg)
Next
• Write n-times iteration wrapper • Optimize K-means• Result Visualization with Python
12/20/12 20
![Page 21: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/21.jpg)
Optimize• If partial centroids for clusters are computed on the map
nodes are computed on the map nodes themselves. (Mapper Local calculate!) and then a weighted average of the centroids is taken later by the reducer. In other words, the mapping was one to one, and so for every point inputted , our mapper outputted a single point which it was necessary to sort and transfer to a reducer.
• We can use Combiner!
12/20/12 21
![Page 22: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/22.jpg)
Dumbo Usage
• Very easy• You can write your own code for Dumbo• Debug easy• Command easy
12/20/12 22
![Page 23: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/23.jpg)
Python and Data Mining
• Books:• 用 Python 行科学 算进 计• 集体智慧 程编• 掘社交网挖 络• 用 Python 行自然 言 理进 语 处• Think Stats Python与数据分析
12/20/12 23
![Page 24: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/24.jpg)
Python and Data Mining
• Tools• Numpy• Scipy• Orange (利用 orange 行 掘)进 关联规则挖
12/20/12 24
![Page 25: Hadoop introduction 2](https://reader033.vdocument.in/reader033/viewer/2022042512/559859631a28abfc5a8b4609/html5/thumbnails/25.jpg)
thanks
12/20/12 25