rm world 2014: a user interface for big data with rapidminer

18
A User Interface For Big Data With RapidMiner Marcelo Beckmann Nelson F. F. Ebecken Beatriz S. L. Pires de Lima Myrian Christina de Aragão Costa

Upload: rapidminer

Post on 05-Dec-2014

172 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: RM World 2014: A user interface for big data with RapidMiner

A User Interface For Big Data With RapidMiner

Marcelo Beckmann

Nelson F. F. Ebecken

Beatriz S. L. Pires de Lima

Myrian Christina de Aragão Costa

Page 2: RM World 2014: A user interface for big data with RapidMiner

Agenda

Introduction

Previous Work

Motivations

Architecture

Operators

Mouse Runner

Experiments

Conclusion

Page 3: RM World 2014: A user interface for big data with RapidMiner

Introduction

Since 2012, 2.5 exabytes of data were created every day, andthis volume is still growing;

How to extract useful information from this daily mountain ofdata?

Map Reduce paradigm and it's related frameworks answered thequestion;

Google, Yahoo, Netflix, Amazon, YouTube, Facebook, and Appleare good examples of successful big data projects;

The Hadoop environment is the result of the great effort madeby open source initiatives since 2004;

Page 4: RM World 2014: A user interface for big data with RapidMiner

Introduction

Despite the great progress made in the backend engines, thereis a lack of user interfaces;

Nowadays, most of the work to configure and run Hadoopcomponents is done through scripting;

This work aims to contribute some how to improve thisscenario.

Page 5: RM World 2014: A user interface for big data with RapidMiner

Previous Work

Since the MapReduce advent, the research and development were more focused on backend engines;

In the last years, several initiatives started to make the Hadoop environment more user friendly;

Companies like Cloudera, Pentaho, Talend, Hortonworks made huge contributions to improve the tools usability in Hadoop environment, specially in execution control, ETL and databases;

Radoop (*) made significant contributions to integrate Mahout to RapidMiner with a proprietary solution.

* Radoop was acquired by Rapidminer in July/2014

Page 6: RM World 2014: A user interface for big data with RapidMiner

Motivations

The Hadoop environment, and specially the Mahout engine, still lack of an open source UI integration;

In terms of Java coding, the job start, remote API calls, and result retrieval from the Hadoop environment is too complex. An encapsulation is needed to simplify this kind of activity;

There are integration and connectivity problems in heterogeneous environments and complex network infrastructure.

Page 7: RM World 2014: A user interface for big data with RapidMiner

Architecture

Our research

Page 8: RM World 2014: A user interface for big data with RapidMiner

Architecture. Big Data Extension

RapidMiner is easy to extend;

A RapidMiner extension with 14 operators was created;

Big data operators can be mixed with already existingRapidMiner operators, in order to run jobs and analyzeresults;

Integrated with Hadoop, HDFS, Hive, Mahout;

Open Source.

. Mouse Runner Provides an extra layer for remote call and activation;

Reduces the coupling between presentation-tier and businessservices;

Start jobs and retrieve results from the Hadoop relatedcomponents.

Page 9: RM World 2014: A user interface for big data with RapidMiner

Operators

Page 10: RM World 2014: A user interface for big data with RapidMiner

Operators

Masters node – Contains all the configuration necessary toconnect the operators to a Hadoop environment;

IO Operators – Execute operations in HDFS and HiveDatabase;

Read Hive Database – Execute queries in Hive Database,returns an ExampleSet with samples, but points to a file inHDFS. Other Big Data operators will refer to this pointedfile, not the samples;

Clustering – Cluster algorithms from Mahout;

Transformation –To perform transformations in Hivedatabase;

Utility - Run scripts through SSH connection, Kill Jobs.

Page 11: RM World 2014: A user interface for big data with RapidMiner

Mouse RunnerMouse Runner simplifies the call to Hadoop components

KMeansRunner runner =new KMeansRunner();

runner.setHost("192.168.13.131");

runner.setHdfsPort("9000");

runner.setMapredPort("9001");

runner.setInputPath("/user/hadoop-users/testdata");

runner.setOutputPath("/user/hadoop-user/output");

runner.setK(5);

runner.setMaxRuns(10);

ClusterResult result = runner.run();

Page 12: RM World 2014: A user interface for big data with RapidMiner

Mouse Runner

Ports to open:

9000, 9001, 50070, 50075, 50090, 50105, 50030, 50060, 8020, 50010, 50020, 50100, 10000, ...

Integration among heterogeneous OS and networks

Page 13: RM World 2014: A user interface for big data with RapidMiner

Mouse Runner

Ports to open:

9999, 10000

Page 14: RM World 2014: A user interface for big data with RapidMiner

Experiments

• K-means clustering comparison between RapidMiner and Mahout using Davies–Bouldin index;

•Davies–Bouldin index: Has an internal evaluation method to measure the quality of clusters, the lower the DBI better the cluster quality;

•The aim is to validate the integration made with Mahout, using the RapidMiner K-Means as baseline;

•Datasets: Synthetic Control, Covertype and Household from UCI machine learning repository;

•Results obtained in terms of Davies-Bouldin were pretty similar;

•RapidMiner had an instant response in the smaller dataset;

•Mahouts scaled better in the bigger datasets.

Page 15: RM World 2014: A user interface for big data with RapidMiner

Experiments

Page 16: RM World 2014: A user interface for big data with RapidMiner

Experiments

Page 17: RM World 2014: A user interface for big data with RapidMiner

Conclusion

•An open source extension for RapidMiner called “Big Data” wascreated;

•This extension Integrates RapidMiner with Hadoop, HDFS, Hive andMahout;

•Counts initially with 14 operators;

•Created a component called “Mouse Runner”, wich provides remoteactivation facilities and a simplified API for activation and resultretrieval for Hadoop related components;

•A comparisson between K-Means operators from RapidMiner andMahout showed similar results in terms of Davies-Bouldin index;

•Mahout scaled better in bigger datasets. RapidMiner had instantresponse in the smaller dataset.

Page 18: RM World 2014: A user interface for big data with RapidMiner

•Thanks for your audience!

[email protected]

[email protected]