carol v. alexandru, sebastiano panichella, harald c. gall · 2017. 2. 22. · 0.525 0.109 0.082...
TRANSCRIPT
Carol V. Alexandru, Sebastiano Panichella, Harald C. GallSoftware Evolution and Architecture Lab
University of Zurich, Switzerland{alexandru,panichella,gall}@ifi.uzh.ch
22.02.2017
SANER’17
Klagenfurt, Austria
The Problem Domain
• Static analysis (e.g. #Attr., McCabe, coupling...)
1
The Problem Domain
v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0
2
• Static analysis (e.g. #Attr., McCabe, coupling...)
The Problem Domain
v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0
2
• Static analysis (e.g. #Attr., McCabe, coupling...)
• Many revisions, fine-grained historical data
A Typical Analysis Process
3
www
clone
select project
A Typical Analysis Process
www
clone
checkout
select project
select revision
3
A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
applytool
storeanalysis
results
3
A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
applytool
storeanalysis
results
more revisions?
3
A Typical Analysis Process
www
clone
checkout analysis tool
Res
select project
select revision
applytool
storeanalysis
results
more revisions?
more projects?
3
Redundancies all over...
4
Redundancies in historical code analysis
Impact on Code Study Tools
Redundancies all over...
Redundancies in historical code analysis
Few files change
Only small parts of a file change
Across RevisionsImpact on Code
Study Tools
4
Redundancies all over...
Redundancies in historical code analysis
Few files change
Only small parts of a file change
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
4
Redundancies all over...
Redundancies in historical code analysis
Few files change
Only small parts of a file change
Changes may not even affect results
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
4
Redundancies all over...
4
Redundancies in historical code analysis
Few files change
Across Languages
Only small parts of a file change
Changes may not even affect results
Each language has their own toolchain
Yet they share many metrics
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
Redundancies all over...
Redundancies in historical code analysis
Few files change
Across Languages
Only small parts of a file change
Changes may not even affect results
Each language has their own toolchain
Yet they share many metrics
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
Re-implementing identical analyses
Generalizability is expensive
4
Redundancies all over...
Redundancies in historical code analysis
Few files change
Across Languages
Only small parts of a file change
Changes may not even affect results
Each language has their own toolchain
Yet they share many metrics
Across RevisionsImpact on Code
Study Tools
Repeated analysis of "known" code
Storing redundant results
Re-implementing identical analyses
Generalizability is expensive
5
Important!
6
Techniques implemented
in LISA
Your favouriteanalysis features
Pick what you like!
#1: Avoid Checkouts
Avoid checkouts
7
clone
Avoid checkouts
7
clone
checkout
read write
Avoid checkouts
7
clone
checkout
read
read write
analyze
Avoid checkouts
7
clone
checkout
read
For every file: 2 read ops + 1 write opCheckout includes irrelevant filesNeed 1 CWD for every revision to be analyzed in parallel
read write
analyze
Avoid checkouts
8
clone analyze
read
Avoid checkouts
8
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
read
Avoid checkouts
8
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
Git
Analysis Tool
File Abstraction Layer
read
Avoid checkouts
8
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
Git
Analysis Tool
File Abstraction Layer
E.g. for the JDK Compiler:
class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code
}
read
Avoid checkouts
clone analyze
Only read relevant files in a single read opNo write opsNo overhead for parallization
Git
Analysis Tool
File Abstraction Layer
E.g. for the JDK Compiler:
class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code
}
read
9
#2: Use a multi-revision representationof your sources
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
10
rev. 1
rev. 2
rev. 3
rev. 4
11
rev. 1
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
12
rev. 2
Merge ASTs
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
13
rev. 3
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
14
rev. 4
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
15
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4
16
rev. range [1-4]
rev. range [1-2]
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4AspectJ (~440k LOC):
1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes
17
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4AspectJ (~440k LOC):
1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes
18
Merge ASTs
rev. 1
rev. 2
rev. 3
rev. 4AspectJ (~440k LOC):
1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes
19
#3: Store AST nodes only ifthey're needed for analysis
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for each method and class?
20
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
140 AST nodes(using ANTLR)
parse
What's the complexity (1+#forks) and name for each method and class?
20
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
140 AST nodes(using ANTLR)
parseCompilationUnit
TypeDeclaration
Modifiers
public
Members
Method
Modifiers
public
Name
run
Name
Demo
Parameters ReturnType
PrimitiveType
VOID
Body
Statements
...
...
What's the complexity (1+#forks) and name for each method and class?
20
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for each method and class?
140 AST nodes(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes(using ANTLR)
21
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for eachmethod and class?
140 AST nodes(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes(using ANTLR)
22
public class Demo {
public void run() {
for (int i = 1; i< 100; i++) {
if (i % 3 == 0 || i % 5 == 0) {
System.out.println(i)
}
}
}
What's the complexity (1+#forks) and name for eachmethod and class?
140 AST nodes(using ANTLR)
parse filtered parse
TypeDeclaration
Method
Name
run
Name
DemoForStatement
IfStatement
ConditionalExpression
7 AST nodes(using ANTLR)
23
#4: Use non-duplicative data structuresto store your results
rev. 1
rev. 2
rev. 3
rev. 4
24
rev. 1
rev. 2
rev. 3
rev. 4
24
rev. 1
rev. 2
rev. 3
rev. 4
24
[1-1]
label
#attr
mcc
[4-4]
label
#attr
mcc
InnerClass
0 4
1 2 4
[2-3]
label
#attr
mcc
rev. 1
rev. 2
rev. 3
rev. 4 [1-1]
label
#attr
mcc
[4-4]
label
#attr
mcc
InnerClass
0 4
1 2 4
[2-3]
label
#attr
mcc
25
LISA also does:#5: Parallel Parsing#6: Asynchronous graph computation#7: Generic graph computationsapplying to ASTs from compatiblelanguages
26
To Summarize...
The LISA Analysis Process
27
www
clone
select project
The LISA Analysis Process
27
www
clone
parallel parseinto mergedgraph
select project
Language Mappings(Grammar to Analysis)
determines which ASTnodes are loaded
ANTLRv4Grammar used by
Generates Parser
The LISA Analysis Process
27
www
clone
parallel parseinto mergedgraph
Res
select project
storeanalysis
results
Analysis formulated as Graph Computation
Language Mappings(Grammar to Analysis) used by
Async. compute
determines which ASTnodes are loaded
determines whichdata is persisted
ANTLRv4Grammar used by
Generates Parser
runson graph
The LISA Analysis Process
27
www
clone
parallel parseinto mergedgraph
Res
select project
storeanalysis
results
more projects?
Analysis formulated as Graph Computation
Language Mappings(Grammar to Analysis) used by
Async. compute
determines which ASTnodes are loaded
determines whichdata is persisted
ANTLRv4Grammar used by
Generates Parser
runson graph
How well does it work, then?
Marginal cost for +1 revision
41.670
4.6330.525 0.109 0.082 0.071 0.052 0.041 0.032 0.033
0
5
10
15
20
25
30
35
40
45
50
1 10 100 1000 2000 3000 4000 5000 6000 7000
Average Parsing+Computation time per Revision when analyzing n revisions of AspectJ (10 common metrics)
Seco
nd
s
# of revisions
28
Overall Performance Stats
29
Language Java C# JavScript
#Projects 100 100 100
#Revisions 646'261 489'764 204'301
#Files (parsed!) 3'235'852 3'234'178 507'612
#Lines (parsed!) 1'370'998'072 961'974'773 194'758'719
Total Runtime (RT)¹ 18:43h 52:12h 29:09h
Median RT¹ 2:15min 4:54min 3:43min
Tot. Avg. RT per Rev.² 84ms 401ms 531ms
Med. Avg. RT per Rev.² 30ms 116ms 166ms
¹ Including cloning and persisting results² Excluding cloning and persisting results
What's the catch?(There are a few...)
The (not so) minor stuff
• Must implement analyses from scratch
• No help from a compiler
• Non-file-local analyses need some effort
30
The (not so) minor stuff
• Must implement analyses from scratch
• No help from a compiler
• Non-file-local analyses need some effort
• Moved files/methods etc. add overhead
• Uniquely identifying files/entities is hard
• (No impact on results, though)
30
Language matters
31
Javascript C# Java
E.g.: Javascript takes longer because:• Larger files, less modularization• Slower parser (automatic semicolon-insertion)
LISA is E X TREME
32
complexfeature-richheavyweight
simplegeneric
lightweight
Thank you for your attention
SANER ‘17, Klagenfurt, 22.02.2017
Read the paper: http://t.uzh.ch/Fj
Try the tool: http://t.uzh.ch/Fk
Get the slides: http://t.uzh.ch/Fm
Contact me: [email protected]
Parallelize Parsing
66
Single Git tree traversal
src/Main.java {1: 1251a4}, {3: fc2452}, {4: 251929}
src/Settings.java {2: fa255a}
src/Foo.java {1: 512fc2}, {4: 791c2a}, {5: bcb215}
src/Bar.java {4: 8a23b2}, {5: b2399f}
Obtain sequence of Git blob ids for old versions of each unique path
Parallelize Parsing
67
Single Git tree traversal
src/Main.java {1: 1251a4}, {3: fc2452}, {4: 251929}
src/Settings.java {2: fa255a}
src/Foo.java {1: 512fc2}, {4: 791c2a}, {5: bcb215}
src/Bar.java {4: 8a23b2}, {5: b2399f}
Obtain sequence of Git blob ids for old versions of each unique path
Parse files with different paths in parallelSome files will have more revisions, taking longer to parse in total Parsing only takes roughly as long as required for the file with the most revisions
Parallelize Parsing
68
Single Git tree traversal
src/Main.java {1: 1251a4}, {3: fc2452}, {4: 251929}
src/Settings.java {2: fa255a}
src/Foo.java {1: 512fc2}, {4: 791c2a}, {5: bcb215}
src/Bar.java {4: 8a23b2}, {5: b2399f}
Obtain sequence of Git blob ids for old versions of each unique path
Parse files with different paths in parallelSome files will have more revisions, taking longer to parse in total Parsing only takes roughly as long as required for the file with the most revisions
"Speed-up factor" for each technique
• Parallel parsing: Roughly 2
• Merged ASTs: >1000 for many revisions
• Filtered parsing: >10 during
computation, depends on how much is
filtered
• All depends on file sizes / parser speed
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
Depending on the node type:- Signal specfic data
70
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
Depending on the node type:- Signal specfic data- Collect specific data
71
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
Depending on the node type:- Signal specfic data- Collect specific data- Store specific data
72
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
Depending on the node type:- Signal specfic data- Collect specific data- Store specific data- Create new nodes & edges
73
rev. 1
rev. 2
rev. 3
rev. 4
74
Example - Method Count (NOM):
Signal:
Collect:
Store:
Depending on the node type:- Signal specfic data- Collect specific data- Store specific data- Create new nodes & edges
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
75
Example - Method Count (NOM):
Signal:METHOD – nom: 1
Collect:
Store:
Depending on the node type:- Signal specfic data- Collect specific data- Store specific data- Create new nodes & edges
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
76
Example - Method Count (NOM):
Signal:METHOD – nom: 1
Collect:CLASS-like – nom += s.nom
Store:
Depending on the node type:- Signal specfic data- Collect specific data- Store specific data- Create new nodes & edges
Asynchronous Graph Analysis
rev. 1
rev. 2
rev. 3
rev. 4
77
Example - Method Count (NOM):
Signal:METHOD – nom: 1
Collect:CLASS-like – nom += s.nom
Store:CLASS-like – nom
Depending on the node type:- Signal specfic data- Collect specific data- Store specific data- Create new nodes & edges
Asynchronous Graph Analysis
Static source code analysis?
78
Static source code analysis?
79
Static source code analysis?
80
Simple Code Metrics(NOC, NOM, WMC, Complexity, ...)
Static source code analysis?
81
Structure(Coupling, Inheritance, ...)
Simple Code Metrics(NOC, NOM, WMC, Complexity, ...)
Static source code analysis?
82
Structure(Coupling, Inheritance, ...)
Simple Code Metrics(NOC, NOM, WMC, Complexity, ...)
Static source code analysis?
83
Practicecode smells
refactoring advicehot-spot detection
bug prediction
Structure(Coupling, Inheritance, ...)
Simple Code Metrics(NOC, NOM, WMC, Complexity, ...)
Static source code analysis?
84
Practicecode smells
refactoring advicehot-spot detection
bug prediction
Researchunderstanding software evolution identifying patterns &anti-patternscode quality assessment techniques...→ code studies
Structure(Coupling, Inheritance, ...)
Simple Code Metrics(NOC, NOM, WMC, Complexity, ...)
(Many) existing studies
85
(Many) existing studies
86
(Many) existing studies
2g
• investigate a small number of projects
(Many) existing studies
• investigate a small number of projects
source: github.com
source: openhub.net
88
(Many) existing studies
• investigate a small number of projects
• analyze a few snapshots of multi-year projects
v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0
source: github.com
source: openhub.net
89
(Many) existing studies
• investigate a small number of projects
• analyze a few snapshots of multi-year projects
90
(Many) existing studies
• investigate a small number of projects
• analyze a few snapshots of multi-year projects
• focus on very few programming languages
91
Why?
92
Only few LanguagesOnly a few projects Only a few snapshots
Why?
93
Only few LanguagesOnly a few projects Only a few snapshots
Time & resource intensive analysis
Why?
94
Only few LanguagesOnly a few projects Only a few snapshots
Time & resource intensive analysis
Each commit analyzed seperately
Why?
95
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Each commit analyzed seperately
Why?
96
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Each commit analyzed seperately
Many preconditions for analyses
Why?
97
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Each commit analyzed seperately
Many preconditions for analyses
Why?
98
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Each commit analyzed seperately
Many preconditions for analyses
Analysis Tools are purpose-built, not designed for large-scale studies
Why?
99
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Each commit analyzed seperately
Many preconditions for analyses
Analysis Tools are purpose-built, not designed for large-scale studies
Possible solution: LISAA fast, multi-purpose, graph-based analysis approach
#3: Use asynchronous graph analysistechniques
Rapid Analysis using LISA
101
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Each commit analyzed seperately
Many preconditions for analyses
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
102
Only few LanguagesOnly a few projects Only a few snapshots
Manual work
Time & resource intensive analysis
Analyze all commits in parallel
Many preconditions for analyses
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
103
Only few LanguagesOnly a few projects Only a few snapshots
«Point and Shoot»
Time & resource intensive analysis
Analyze all commits in parallel
Many preconditions for analyses
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
104
Only few LanguagesOnly a few projects Only a few snapshots
«Point and Shoot»
Time & resource intensive analysis
Analyze all commits in parallel
Analysis directly on source code
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
105
All code is equal(given a parser)
Only a few projects Only a few snapshots
«Point and Shoot»
Time & resource intensive analysis
Analyze all commits in parallel
Analysis directly on source code
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
106
All code is equal(given a parser)
Only a few projects Only a few snapshots
«Point and Shoot»
Quick analysis of 1000s of commits
Analyze all commits in parallel
Analysis directly on source code
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
107
All code is equal(given a parser)
Only a few projects All commits
«Point and Shoot»
Quick analysis of 1000s of commits
Analyze all commits in parallel
Analysis directly on source code
Fast, general purpose code analysis tool aimed specifically at large scale
Rapid Analysis using LISA
108
All code is equal(given a parser)
Only a few projects All commits
«Point and Shoot»
Quick analysis of 1000s of commits
Analyze all commits in parallel
Analysis directly on source code
Fast, general purpose code analysis tool aimed specifically at large scale
# of Commits: Linear Scaling
MarchUpdate(Total)
109
Multi-Project Parallelization
27
LISA
AspectJ:7642 commits, 440k LOC
Requirements:20 GB memory45 min on 4 cores
Multi-Project Parallelization
27
LISA
AspectJ:7642 commits, 440k LOC
Requirements:20 GB memory45 min on 4 cores
VM VM
VM
VMVM
VM
VM
VM
VM
VM
Parallelization scenario:10x Amazon EC 2 r3.8xlarge10x 244GB memory10x 32 cores
Potential to analyze 160AspectJ-sized projects per hour
(Most projects on git-hub are much smaller)
Benchmarking
10
Benchmarking
10
Is this good or bad?What does it even
mean?
Is this metric too high?
Here‘s a code smell - should I fix it?
Benchmarking
10
Benchmarking
10
Create reference points for orientation
Cluster & Compare
11
Cluster & Compare
11
• Discover „phenotypes“
• By metric values
Cluster & Compare
11
• Discover „phenotypes“
• By metric values
• By metric evolution over
time
Cluster & Compare
11
• Discover „phenotypes“
• By metric values
• By metric evolution over
time
• Compare programming
languages / paradigms
Cluster & Compare
11
• Discover „phenotypes“
• By metric values
• By metric evolution over
time
• Compare programming
languages / paradigms
Find interesting projects for further study &identify evolution patterns to find (un-)desirable ones
Machine Learning
12
Sequence of Changes(observable)
Bug Reports(training)
HMM
Machine Learning
12
Sequence of Changes(observable)
Bug Reports(training)
HMM
Bug-Proneness(hidden)
Machine Learning
12
Sequence of Changes(observable)
Bug Reports(training)
HMM
Bug-Proneness(hidden)
Leverage big data to train classifiers and predictors
Study Replication
13
Conclusions X, Y, Z
Study Replication
13
Conclusions X, Y, Z
Replicable with 1000s of projects?
Study Replication
13
Conclusions X, Y, Z
Replicable with 1000s of projects?
Confirmation or rebuttal of existing assumptions