présentation on radoop

13
Présentation On Presented by:- Sudipta Mahapatra RegNo:-1021209061 RollNo:-60,7 th Sem ,IT

Upload: siliconsudipt

Post on 03-Jul-2015

327 views

Category:

Education


0 download

DESCRIPTION

Radoop

TRANSCRIPT

Page 1: Présentation on radoop

Présentation On

Presented by:-

Sudipta Mahapatra

RegNo:-1021209061

RollNo:-60,7th Sem ,IT

Page 2: Présentation on radoop

Contents • Introduction

• Why RADOOP

• Architecture

• Radoop sub-parts

• Benefits

• Conclusion and future work

Page 3: Présentation on radoop

Introduction

• Radoop is a tool used for data analysis.

• It is devloped by Gábor Makrai .

• Radoop closely integrates the highly optimized data analytics capabilities of Hadoop clusters, the distributed data warehouse Hive, and Mahout into the user-friendly interface of RapidMiner. This results in a powerful and easy-to-use data analytics solution for Hadoop.

Page 4: Présentation on radoop

Why Radoop ?

130

1227

7910

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

2005 2010 2015

Data is growing:It’s growing. Quickly. And it’s everywhere.

Page 5: Présentation on radoop

New kinds of data Structured data vs. Unstructured data growth

Complex, Unstructured

Relational Analysis gap

Our

ability to

analyze

Analysis gap

“The sexy job in the next 10 years will be statisticians”

– Hal Varian, Chief Economist at Google

•Digital universe grew by 62% last year to 800K petabytes and will grow to 1.2 “zettabytes” this year.

Page 6: Présentation on radoop

Architechture Hive: A data warehouse infrastructure for data summarization & ad hoc querying. Mahout: A Scalable machine learning and datamining library. •HDFS is a distributed file system

designed to hold very large amounts of

data and provide high-throughput

access to this information. •The Map-Reduce programming

model is a Framework for distributed

processing of large data sets.

• RapidMiner is a toolkit for

datamining.

Page 7: Présentation on radoop

RapidMiner RapidMiner, formerly YALE (Yet Another Learning

Environment), is an environment for machine learning , data mining, text mining, predictive analytics, and business analytics.

It is used for research, education, training, application development, and industrial applications.

RapidMiner provides data mining and machine learning procedures including: data loading and transformation (ETL), data preprocessing and visualization, modeling, evaluation, and deployment. It is able to generate graphs like MS Excel.

It is also used for analyzing data generated by high-throughput instruments used in processes such as genotyping, proteomics, and mass spectrometry.

Page 8: Présentation on radoop

Hive and Mahout

• Hive : is a data warehouse infrastructure built on top of Hadoop, i.e. it uses the distributed file system of Hadoop and the efficient access technologies. Hive was initially developed by Facebook and is now used and developed by many other companies for their distributed data warehouse.

• Mahout : is a machine learning library already offering many scalable machine learning libraries implemented as well on top of Hadoop and its map & reduce paradigm. Hence, Mahout is one of the first distributed data analytics framework making use of the power of Hadoop.

Page 9: Présentation on radoop

Map Reduce

The Map-Reduce programming model -Framework for distributed processing of large data sets

Natural for: – Log processing – Web search indexing – Ad-hoc queries

Page 10: Présentation on radoop

HDFS

HDFS is a distributed file system designed to hold very large amounts of data and provide high-throughput access to this information.

Very Large Distributed File System

– 10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware

– Files are replicated to handle hardware failure

– Detect failures and recovers from them

Optimized for Batch Processing

– Data locations exposed so that computations can

move to where data resides

– Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

No RAID required.

Page 11: Présentation on radoop

Advantages

Scalability: Even data volumes in the terabyte and petabyte range can be analyzed.

Radoop provides a user-friendly interface for editing and running ETL, analytics, and machine learning processes on Hadoop.

Radoop provides easy-to-use graphical interface.

It eliminates the ETL bottlenecks.

Page 12: Présentation on radoop

Conclusion and future work It has experimentally proved that within a time up to 1-8 gb

of data analyzed with 4-16 nodes in radoop where in rapidminor up to 1gb of data can be analyzed.

“we believe more than half of the world’s data will be stored in Apache Hadoop within 5 years” Hortonworks.

Radoop is opening the doors for people who are less comfortable with Hadoop but want to use Hadoop for Big Data analysis.

Page 13: Présentation on radoop

Questions ?