efficiently mining source code with boa
DESCRIPTION
Hoan Anh Nguyen. Tien N. Nguyen. Hridesh Rajan. Efficiently Mining Source Code with Boa. Robert Dyer. - PowerPoint PPT PresentationTRANSCRIPT
Efficiently Mining Source Codewith Boa
Robert Dyer
The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.
Tien N. NguyenHridesh Rajan Hoan Anh Nguyen
2
What do I mean bysoftware repository?
3
4
What features do they have?
5
What do I mean bymining software repositories (MSR)?
6
7
What are some examples ofsoftware repository mining?
8
What is the most used programming language?
9
How many words are in commit messages?
Words[] = update, 30715Words[] = cleanup, 19073Words[] = updated, 18737Words[] = refactoring, 11981Words[] = fix, 11705Words[] = test, 9428Words[] = typo, 9288Words[] = updates, 7746Words[] = javadoc, 6893Words[] = bugfix, 6295
10
How has unit testing evolved over time?
JUnit 4 release
11
What makes thisultra-large-scale mining?
12
Previous examples queried...
Projects 699,331
Code Repositories 494,158
Revisions 15,063,073
Unique Files 69,863,970
File Snapshots 147,074,540
AST Nodes 18,651,043,23
Over 250GB of pre-processed data
13
What doesbringing BIGDATA to the masses
mean?
14
How has unit testing evolved over time?
How can we solve this task?
15
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
16
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Volume
17
Challenge: Volume
Projects 699,331
Code Repositories 494,158
Revisions 15,063,073
Unique Files 69,863,970
File Snapshots 147,074,540
AST Nodes 18,651,043,23
How do you:
Find such a large dataset? Transform the data for analysis?Access this data? Efficiently analyze the data?Store the data?
18
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Velocity
19
Challenge: Velocity
20
Challenge: Velocity
21
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Variety
22
Challenge: Variety
Ultra-large-scale Software Repository MiningThe Boa Experience
[ICSE'14][ICSE'13][GPCE'13][SPLASH'13 SRC][TOSEM] (under review)
24
Boa's Architecture
Replicate
Stored oncluster
User submitsquery
Deployed andexecuted on cluster
Query resultreturnedvia web
cache
Boa's Data Infrastructure
and Transform
Compiled intoHadoop program
Boa's Computing Infrastructure
25
Resultsforeachmine project
metadata
Hasrepository?
Method has@Test?
yes
yes
Accessrepository
Find allmethods
Find allsource files
minerevisions
minesources
Challenge: Volume
Challenge: VelocityChallenge: Variety
26
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Automatically parallelized
Analyzes 18 billion AST nodes in minutes
Only 10 lines of code
No external libraries
A better solution...
27
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
28
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
});
29
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier ->
});
30
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION &&
});
31
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name))
});
32
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
visit(input, visitor {
before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
33
How has unit testing evolved over time?
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
34
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
input = project1
input = project2
input = project3
input = projectn
.
.
.
Dataset
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Boa Program
Boa Program
Boa Program
Boa Program
.
.
.
Tests
Tests[631152000] = 5Tests[631154020] = 12Tests[631161103] = 14Tests[631172392] = 18 . . .
Output
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Tests[631152000] << 1;
631152000, 1
Tests[631154020] << 1;
631152000, 1631154020, 1631152000, 1631154020, 1631154020, 1631161103, 1
35
Automatic Parallelization
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Output variables with built in aggregator functions:sum, mean, top(k), bottom(k), set, collection, etc
Compiler generates Hadoop MapReduce code
36
Abstracting MSR with Types
Tests: output sum[timestamp] of int;
cur_time:timestamp;
visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});
Custom domain-specific types for mining software repositories5 base types and 9 types for source code
No need to understand multiple data formats or APIs
37
Abstracting MSR with Types
Project
CodeRepository
Revision
ChangedFile
ASTRoot
1
1..*
1
*
1
*
1
0..1
38
Abstracting MSR with Types
ASTRoot
Namespace
Declaration
1
*
1
1..*
Method Variable Type
1
*
1
*
1
*
Statement Expression*
*1
1
39
Challenge: How can we make mining source code easier?
Answer: Declarative Visitors
40
Background: Visitor Pattern
Rectangle
Triangle
draw(Graphics g)scale(int x, int y)
Circledraw(Graphics g)scale(int x, int y)
draw(Graphics g)scale(int x, int y)
Rectangle
Triangle
accept(Visitor v)
Circleaccept(Visitor v)
accept(Visitor v)
DrawVisitorvisit(Rectangle r)
visit(Circle c)visit(Triangle t)
ScaleVisitorvisit(Rectangle r)
visit(Circle c)visit(Triangle t)
41
Easing Source Code Mining with Visitors
id := visitor {before T -> statement;after T -> statement;
};
visit(node, id);
42
Easing Source Code Mining with Visitors
id := visitor {before id : T1 -> statement;
before T2, T3 -> statement;
before _ -> statement;};
43
Easing Source Code Mining with Visitors
ASTRoot
Namespace
Declaration
Method Variable Type
Statement Expression
ASTRoot
Namespace
Declaration
Method Variable Type
Statement Expression
44
before n: Declaration -> {
}
Easing Source Code Mining with Visitors
Method Type
Statement Expression
ASTRoot
Namespace
Declaration
Variable
before n: Declaration -> {foreach (i: int; n.fields[i])
visit(n.fields[i]);
}
before n: Declaration -> {foreach (i: int; n.fields[i])
visit(n.fields[i]);stop;
}
46
Summary
Ultra-large-scale software repository miningposes several challenges
Automatically parallelizes queries
Domain-specific language, types, and functionsto make mining software repositories easier
Boa provides abstractions to addressthese challenges
Ultra-large-scale dataset with almost 700k projects
47
Boa's Global Impact
90+ users from over 20 countries!