mapreduce as a general framework to support research...

MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)

published in Mining Software Repositories 2009

Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan

Presenter: Jihun Park

SELab

2013.06.07

LAB Seminar

Mining Software Repositories

2

Version control system Bug tracking system

Source code Patch (for a rev.)

Author info.

The cause of bugs Severity of bugs Status of bugs

Research Questions • What kind of patches are likely to

be a bug? (LOC, # of method..)

• Can we use XX information of code

entities to predict additional

change locations? (co-change,

structure..)

• How do the software evolve?

(when refactoring occur?..)

• How can we predict buggy files /

patches? (code change complexity,

machine learning algorithm)


3

A revision with a patch for a file A revision with patches for many files

Revision 324 Revision 332 Revision 352 Revision 370

Suggestion


4

• Figure out modified methods • Identify the modified lines of codes • Identify the relationship between changed methods and others • Connect fix revision to bug report • …

The size of repository is getting bigger !!

Commit log: Fix bug 12345 ========================= void foo(){ +++ int a = 0; ----- int a = 1; } void bar(){ } …

Motivation and Findings

• Motivation

– Mining Software Repository is one of the main research area in software engineering.

– Software repositories are getting bigger.

– Big data analysis technique (e.g., MapReduce) can facilitate analyzing large repositories.

• Findings

– It is easy to migrate existing algorithm to distributed system.

– MapReduce can improve the analysis speed.

5

Outline

• Mining Software Repositories

• Motivation

• Big Data Analysis and MapReduce

• Approach

• Experimental Setup

• Evaluation

• My Research Area

6

Big Data Analysis

• Big data requires exceptional technologies to efficiently process large quantities of data.

• Big data techniques include

– Association rule mining

– Machine learning

– Genetic algorithm

– Pattern recognition

– …

• Existing methodologies, but the problem is scalability

7

Scaling out (distributed systems) is always better than scaling up (bigger and more powerful machines)

MapReduce

• One of the famous programming model for a parallel, distributed algorithm on a cluster.

8

Approach Overview

• Can map-reduce support MSR research?

1. Adaptability - is it easy to migrate to map reduce approach?

2. Efficiency - is it faster than non-distributed approach?

3. Scalability - is it scalable with the size of input data?

4. Flexibility - is it run on different types of machines? 9

J-REX

Extraction

Parsing

Analysis

DJ-REX1

Extraction

Parsing

Analysis

DJ-REX2

Extraction

Parsing

Analysis

DJ-REX3

Extraction

Parsing

Analysis

J-Rex

10

Extract series of snapshots for each file

Parse each snapshot to XML format using JDT AST parser

Analyze evolutionary change data to get evolutionary change data such as change type, message, time, etc.

MapReduce Strategy for DJ-REX

• DJ-REX1: Extraction and parsing is done by one machine, then distributed machines analyze.

• DJ-REX2: Extraction is done by one machine, then remaining phases is done by distributed machines.

• DJ-REX3: Every phase is done by distributed machines. 11

Experimental setup

• Two server machine and two desktop machine.

• Server machine have SSD which is much faster than normal hard disk.

• Experiments is done with three open source projects.

12

Repository Size

# Source Code files

Length of History

# Revisions

Datatools 394MB 10,552 2 years 2,398

BIRT 810MB 13,002 4 years 19,583

Eclipse 4.2GB 56,851 8 years 82,682

Characteristics of Eclipse, BIRT, and Datatools

1. Adaptability

• Does the Hadoop migration take long time?

• Migration is very easy.

– Hadoop provides mapping algorithm such as “MultiFileSplit” and “DBInputSplit”.

– Hadoop has well-defined and simple APIs.

– There are available several code examples.

13

J-REX Logic No Change

MapReduce strategy for DJ-REX1 400 LOC, 2 hours



Deployment Configuration 1 hour

Reconfiguration 1 minute

2. Efficiency

Repository Desktop Server Strategy 2 nodes 3 nodes 4 nodes

Datatools 0:35:50 0:34:14 DJ-REX3 0:19:52 0:14:32 0:16:40

BIRT 2:44:09 2:05:55 DJ-REX1 DJ-REX2 DJ-REX3

2:03:51 1:40:22 1:08:36

2:05:02 1:40:32 0:50:33

2:16:03 1:47:26 0:45:16

Eclipse - 12:35:34 DJ-REX3 - - 3:49:05

14

Experimental results for DJ-REX in Hadoop

• Experiment shows two main sub-conclusion for efficiency.

– Faster machine can speed up the mining process.

– All DJ-REX solutions outperforms non-distributed approach.

2. Efficiency (cont’d)

• Preprocess time is time needed for non-distributed phase.

• The copy data time increase when adding nodes.

• The fully distributed DJ-REX3 is the most efficient.

15 Comparison of the running time of the 3 flavors of DJ-REX for BIRT

3. Scalability

• The bigger the repository is, the more time can be saved by Hadoop.

• Haddop scales well for different number of nodes.

• Overhead of copying input data to another node can out-weigh parallelizing tasks to another node.

16 Running time comparison for BIRT and Datatools with DJ-REX3

4. Flexibility

• Hadoop runs on many different platforms (Window, Mac, Unix, etc.)

• In this experiment, several different machine (two desktops and two servers) used.

• Load balance control in Hadoop assure a fair distribution of work.

17

Conclusion

• It is easy to migrate existing mining algorithm to distributed system.

• Big data analysis technique, MapReduce, can improve the analysis speed.

• There is data distribution overhead, which determine the optimum number of nodes.

• Adding a machine is very easy, which means the approach is scalable.

18

Discussion

• MapReduce approach can be used splitting mailing list, mapping bug reports, etc.

• Copying into HDFS can be overhead – finding out the optimal Hadoop configuration is future work.

19

My Research Area

20

Bug 22

Fix #22

Bug 31 … …

Development history

Type 1 bug Type 2 bug

Bug reports

Fix commits

An initial patch

Supplementary patches

Fix #31

Fix #31

Fix #31

The bug IDs that were mentioned only one commit.

The bug IDs that were mentioned in multiple fix revisions.

My Research Area

• Empirical results

– A considerable portion of bugs require supplementary patch.

– Type II bugs are more severe.

– Type II bugs take long time to be fixed.

– Incomplete patches are larger in size and more scattered.

• Can previous prediction approaches be used to predict supplementary change locations?

– Code Clone

– Co-Change

– Structural relationship (Inheritance, method calls, etc.)

21

My Research Area

• Existing approach is not enough to predict supplementary change locations!

– Only small portion of supplementary patch is code clone of initial patch

– A considerable portion of supplementary patch cannot predict using structural dependency (e.g., method call, inheritance)

– Historical co-change has low precision on prediction.

• How can we predict additional change locations?

• Anyone who interested in MSR area, feel free to contact me.

22

23

Thank you for listening

C-REX Change schema

24

Three DJ-REX approach

25

J-REX

Extraction

Parsing

Analysis

DJ-REX1

Extraction

Parsing

Analysis

DJ-REX2

Extraction

Parsing

Analysis

DJ-REX3

Extraction

Parsing

Analysis

• DJ-REX1: Use Map-reduce for analysis phase.

• DJ-REX2: Use Map-reduce for parsing and analysis phase.

• DJ-REX3: Use Map-reduce for all phase.

26

Running time of the basic J-REX on a desktop and server machine, and of DJ-REX3 on 3 virtual machines on the same server machine

mapreduce as a general framework to support research...

Documents