mapreduce as a general framework to support research...
TRANSCRIPT
MapReduce as a General Framework to Support Research in Mining Software Repositories (MSR)
published in Mining Software Repositories 2009
Weiyi Shang, Zhen Ming Jiang, Bram Adams, Ahmed E. Hassan
Presenter: Jihun Park
SELab
2013.06.07
LAB Seminar
Mining Software Repositories
2
Version control system Bug tracking system
Source code Patch (for a rev.)
Author info.
The cause of bugs Severity of bugs Status of bugs
Research Questions • What kind of patches are likely to
be a bug? (LOC, # of method..)
• Can we use XX information of code
entities to predict additional
change locations? (co-change,
structure..)
• How do the software evolve?
(when refactoring occur?..)
• How can we predict buggy files /
patches? (code change complexity,
machine learning algorithm)
Mining Software Repositories
3
A revision with a patch for a file A revision with patches for many files
Revision 324 Revision 332 Revision 352 Revision 370
Suggestion
Mining Software Repositories
4
• Figure out modified methods • Identify the modified lines of codes • Identify the relationship between changed methods and others • Connect fix revision to bug report • …
The size of repository is getting bigger !!
Commit log: Fix bug 12345 ========================= void foo(){ +++ int a = 0; ----- int a = 1; } void bar(){ } …
Motivation and Findings
• Motivation
– Mining Software Repository is one of the main research area in software engineering.
– Software repositories are getting bigger.
– Big data analysis technique (e.g., MapReduce) can facilitate analyzing large repositories.
• Findings
– It is easy to migrate existing algorithm to distributed system.
– MapReduce can improve the analysis speed.
5
Outline
• Mining Software Repositories
• Motivation
• Big Data Analysis and MapReduce
• Approach
• Experimental Setup
• Evaluation
• My Research Area
6
Big Data Analysis
• Big data requires exceptional technologies to efficiently process large quantities of data.
• Big data techniques include
– Association rule mining
– Machine learning
– Genetic algorithm
– Pattern recognition
– …
• Existing methodologies, but the problem is scalability
7
Scaling out (distributed systems) is always better than scaling up (bigger and more powerful machines)
MapReduce
• One of the famous programming model for a parallel, distributed algorithm on a cluster.
8
Approach Overview
• Can map-reduce support MSR research?
1. Adaptability - is it easy to migrate to map reduce approach?
2. Efficiency - is it faster than non-distributed approach?
3. Scalability - is it scalable with the size of input data?
4. Flexibility - is it run on different types of machines? 9
J-REX
Extraction
Parsing
Analysis
DJ-REX1
Extraction
Parsing
Analysis
DJ-REX2
Extraction
Parsing
Analysis
DJ-REX3
Extraction
Parsing
Analysis
J-Rex
10
Extract series of snapshots for each file
Parse each snapshot to XML format using JDT AST parser
Analyze evolutionary change data to get evolutionary change data such as change type, message, time, etc.
MapReduce Strategy for DJ-REX
• DJ-REX1: Extraction and parsing is done by one machine, then distributed machines analyze.
• DJ-REX2: Extraction is done by one machine, then remaining phases is done by distributed machines.
• DJ-REX3: Every phase is done by distributed machines. 11
Experimental setup
• Two server machine and two desktop machine.
• Server machine have SSD which is much faster than normal hard disk.
• Experiments is done with three open source projects.
12
Repository Size
# Source Code files
Length of History
# Revisions
Datatools 394MB 10,552 2 years 2,398
BIRT 810MB 13,002 4 years 19,583
Eclipse 4.2GB 56,851 8 years 82,682
Characteristics of Eclipse, BIRT, and Datatools
1. Adaptability
• Does the Hadoop migration take long time?
• Migration is very easy.
– Hadoop provides mapping algorithm such as “MultiFileSplit” and “DBInputSplit”.
– Hadoop has well-defined and simple APIs.
– There are available several code examples.
13
J-REX Logic No Change
MapReduce strategy for DJ-REX1 400 LOC, 2 hours
MapReduce strategy for DJ-REX2 400 LOC, 2 hours
MapReduce strategy for DJ-REX3 300 LOC, 1 hours
Deployment Configuration 1 hour
Reconfiguration 1 minute
2. Efficiency
Repository Desktop Server Strategy 2 nodes 3 nodes 4 nodes
Datatools 0:35:50 0:34:14 DJ-REX3 0:19:52 0:14:32 0:16:40
BIRT 2:44:09 2:05:55 DJ-REX1 DJ-REX2 DJ-REX3
2:03:51 1:40:22 1:08:36
2:05:02 1:40:32 0:50:33
2:16:03 1:47:26 0:45:16
Eclipse - 12:35:34 DJ-REX3 - - 3:49:05
14
Experimental results for DJ-REX in Hadoop
• Experiment shows two main sub-conclusion for efficiency.
– Faster machine can speed up the mining process.
– All DJ-REX solutions outperforms non-distributed approach.
2. Efficiency (cont’d)
• Preprocess time is time needed for non-distributed phase.
• The copy data time increase when adding nodes.
• The fully distributed DJ-REX3 is the most efficient.
15 Comparison of the running time of the 3 flavors of DJ-REX for BIRT
3. Scalability
• The bigger the repository is, the more time can be saved by Hadoop.
• Haddop scales well for different number of nodes.
• Overhead of copying input data to another node can out-weigh parallelizing tasks to another node.
16 Running time comparison for BIRT and Datatools with DJ-REX3
4. Flexibility
• Hadoop runs on many different platforms (Window, Mac, Unix, etc.)
• In this experiment, several different machine (two desktops and two servers) used.
• Load balance control in Hadoop assure a fair distribution of work.
17
Conclusion
• It is easy to migrate existing mining algorithm to distributed system.
• Big data analysis technique, MapReduce, can improve the analysis speed.
• There is data distribution overhead, which determine the optimum number of nodes.
• Adding a machine is very easy, which means the approach is scalable.
18
Discussion
• MapReduce approach can be used splitting mailing list, mapping bug reports, etc.
• Copying into HDFS can be overhead – finding out the optimal Hadoop configuration is future work.
19
My Research Area
20
Bug 22
Fix #22
Bug 31 … …
Development history
Type 1 bug Type 2 bug
Bug reports
Fix commits
An initial patch
Supplementary patches
Fix #31
Fix #31
Fix #31
The bug IDs that were mentioned only one commit.
The bug IDs that were mentioned in multiple fix revisions.
My Research Area
• Empirical results
– A considerable portion of bugs require supplementary patch.
– Type II bugs are more severe.
– Type II bugs take long time to be fixed.
– Incomplete patches are larger in size and more scattered.
• Can previous prediction approaches be used to predict supplementary change locations?
– Code Clone
– Co-Change
– Structural relationship (Inheritance, method calls, etc.)
21
My Research Area
• Existing approach is not enough to predict supplementary change locations!
– Only small portion of supplementary patch is code clone of initial patch
– A considerable portion of supplementary patch cannot predict using structural dependency (e.g., method call, inheritance)
– Historical co-change has low precision on prediction.
• How can we predict additional change locations?
• Anyone who interested in MSR area, feel free to contact me.
22
23
Thank you for listening
C-REX Change schema
24
Three DJ-REX approach
25
J-REX
Extraction
Parsing
Analysis
DJ-REX1
Extraction
Parsing
Analysis
DJ-REX2
Extraction
Parsing
Analysis
DJ-REX3
Extraction
Parsing
Analysis
• DJ-REX1: Use Map-reduce for analysis phase.
• DJ-REX2: Use Map-reduce for parsing and analysis phase.
• DJ-REX3: Use Map-reduce for all phase.
26
Running time of the basic J-REX on a desktop and server machine, and of DJ-REX3 on 3 virtual machines on the same server machine