machine learning on mapreduce framework
Post on 14-Jul-2015
134 Views
Preview:
TRANSCRIPT
Abhijit Kumar Behera
M.Tech (CSE)
Roll No. 1350001
School of Computer Engineering
Guided By : Dr. Laxman Sahoo
Contents
Introduction
Apache Hadoop related projects Application of Mahout Literature Survey
Plan of Action
Conclusion
References
Introduction
•The K-means algorithm is one of the most well-known clustering algorithms that has been frequently used to variety of problems. •MapReduce as the most popular cloud computing parallel framework is effective to handle massive data, the researches of K-means clustering algorithm which is based on MapReduce become a focus for scholars.
Components of Hadoop
HDFS •Name Node •Data Node •Secondary Name Node
Map Reduce •Map() •Combine() •Reduce()
YARN •Job Tracker •TaskTracker
HBase
MapReduce Word count process
Hadoop ( HDFS and
MapReduce)
HBase
Mahout
Spark
HIVE
Zookeeper Sqoop
PIG
Apache Hadoop Projects
Application of Mahout
Collaborative Filtering Matrix factorization based recommenders
A user based Recommender
Clustering Canopy Clustering
K-Means Clustering
Fuzzy K-Means
Affinity Propagation Clustering
Classification Naive Bayes
Random forest classifier
Literature Survey
An Improved parallel K-means Clustering Algorithm with MapReduce Authors Name: Qing Liao, Fan Yang, Jingming Zhao Journal : Communication Technology (ICCT), IEEE Year of Publication:2014
Parallel K-means Algorithm 1) Initial 2) Mapper 3) Reducer
Literature Survey...
Literature Survey Clouds for Scalable Big Data Analytics
Authors Name: Domenico Talia Journal: IEEE Computer Society
Year of Publication:2013 In this paper, author describe how cloud computing enhance the development and
functionality of Big Data Analytics when it deployed into it. Cloud Service Model Features Users
Data analytics software as a service A single and complete data mining
application or task (including data sources)
offered as a service
End users, analytics managers, data
analysts
Data analytics platform as a service A data analysis suite or framework for
programming or developing high-level
applications, hiding the cloud
infrastructure and data storage
Data mining application developers,
data scientists
Data analytics infrastructure as a
service
A set of virtualized resources provided to a
programmer or data mining researcher for
developing, configuring, and running data
analysis frameworks or applications
Data mining programmers, data
management developers, data
mining researchers
Plan of Action
August - October 2014 Literature survey is done.
November 2014
Problem definition formulation is
done and problem solving outline are
yet to be done
December 2014- January 2015 Find out the appropriate solution of
the problem yet to be formulated
February-May 2015 Final implementation of the solution
with result yet to be done
Conclusion
Large-scale data mining has been a new challenge in recent years. Using the Map-Reduce frame work the big data analytics can be accomplished. The K-means algorithm is one of the most well-known clustering algorithms. However, its processing performance has usually encountered a bottleneck if being utilized to deal with massive data. A parallel K-means algorithm with MapReduce which shows obvious advantage is implemented to handle massive data.
References
[1] Walisa Romsaiyud, Wichian Premchaiswadi, " An Adaptive Machine Learning on Map-
Reduce Framework for Improving performance of Large-Scale Data Analysis on EC ",
Eleventh IEEE Int'l Conf. on ICT and knowledge Engineering, 2014
[2] Domenico Talia," Clouds for Scalable Big Data Analytics ", IEEE Computer Society, 2013
[3] Feng Ye, Zhijan Wang , "Cloud-based Big Data Mining & Analyzing Services
Platform integrating R", IEEE International Conference on Advance Cloud and Big Data
, 2013
[4]. Apache-Hadoop -http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F
top related