machine learning on mapreduce framework

Abhijit Kumar Behera

M.Tech (CSE)

Roll No. 1350001

School of Computer Engineering

Guided By : Dr. Laxman Sahoo

Contents

Introduction

Apache Hadoop related projects Application of Mahout Literature Survey

Plan of Action

Conclusion

References

Introduction

•The K-means algorithm is one of the most well-known clustering algorithms that has been frequently used to variety of problems. •MapReduce as the most popular cloud computing parallel framework is effective to handle massive data, the researches of K-means clustering algorithm which is based on MapReduce become a focus for scholars.

Components of Hadoop

HDFS •Name Node •Data Node •Secondary Name Node

Map Reduce •Map() •Combine() •Reduce()

YARN •Job Tracker •TaskTracker

MapReduce Word count process

Hadoop ( HDFS and

MapReduce)

Mahout

Zookeeper Sqoop

Apache Hadoop Projects

Application of Mahout

Collaborative Filtering Matrix factorization based recommenders

A user based Recommender

Clustering Canopy Clustering

K-Means Clustering

Fuzzy K-Means

Affinity Propagation Clustering

Classification Naive Bayes

Random forest classifier

Literature Survey

An Improved parallel K-means Clustering Algorithm with MapReduce Authors Name: Qing Liao, Fan Yang, Jingming Zhao Journal : Communication Technology (ICCT), IEEE Year of Publication:2014

Parallel K-means Algorithm 1) Initial 2) Mapper 3) Reducer

Literature Survey...

Literature Survey Clouds for Scalable Big Data Analytics

Authors Name: Domenico Talia Journal: IEEE Computer Society

Year of Publication:2013 In this paper, author describe how cloud computing enhance the development and

functionality of Big Data Analytics when it deployed into it. Cloud Service Model Features Users

Data analytics software as a service A single and complete data mining

application or task (including data sources)

offered as a service

End users, analytics managers, data

analysts

Data analytics platform as a service A data analysis suite or framework for

programming or developing high-level

applications, hiding the cloud

infrastructure and data storage

Data mining application developers,

data scientists

Data analytics infrastructure as a

service

A set of virtualized resources provided to a

programmer or data mining researcher for

developing, configuring, and running data

analysis frameworks or applications

Data mining programmers, data

management developers, data

mining researchers

Plan of Action

August - October 2014 Literature survey is done.

November 2014

Problem definition formulation is

done and problem solving outline are

yet to be done

December 2014- January 2015 Find out the appropriate solution of

the problem yet to be formulated

February-May 2015 Final implementation of the solution

with result yet to be done

Conclusion

Large-scale data mining has been a new challenge in recent years. Using the Map-Reduce frame work the big data analytics can be accomplished. The K-means algorithm is one of the most well-known clustering algorithms. However, its processing performance has usually encountered a bottleneck if being utilized to deal with massive data. A parallel K-means algorithm with MapReduce which shows obvious advantage is implemented to handle massive data.

References

[1] Walisa Romsaiyud, Wichian Premchaiswadi, " An Adaptive Machine Learning on Map-

Reduce Framework for Improving performance of Large-Scale Data Analysis on EC ",

Eleventh IEEE Int'l Conf. on ICT and knowledge Engineering, 2014

[2] Domenico Talia," Clouds for Scalable Big Data Analytics ", IEEE Computer Society, 2013

[3] Feng Ye, Zhijan Wang , "Cloud-based Big Data Mining & Analyzing Services

Platform integrating R", IEEE International Conference on Advance Cloud and Big Data

, 2013

[4]. Apache-Hadoop -http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F

machine learning on mapreduce framework

massive data

data sources

data mining researchers

data management developers

data analysis frameworks

data analysis suite

node data node secondary

analytics managers

Engineering

mapreduce algorithm framework

mrjs: a javascript mapreduce framework for web browsers

a framework for integrating batch and online mapreduce...

hadoop/mapreduce - 123seminarsonly.comhadoop mapreduce •...

community spotlight apache mapreduce - intel · apache...

a coordination framework for deploying hadoop mapreduce...

machine learning with mapreduce. k-means clustering 3

cloudflow – a framework for mapreduce pipeline...

mapreduce as a general framework to support research...

machine learning with mapreduce

a mapreduce framework for heterogeneous computing

rdfs/owl reasoning using the mapreduce framework

mapreduce in amazon web services. introduction amazon...

mapreduce framework performance comparison€¦ ·...

panda: mapreduce framework on gpu’s and cpu’s

improving the mapreduce big data processing …improving the...

resource provisioning framework for mapreduce jobs with...

dynmr: a dynamic slot allocation framework for mapreduce...

large scale machine learning based on mapreduce & gpu

bigdata- survey on scheduling methods in hadoop mapreduce...