carol v. alexandru, sebastiano panichella, harald c. gall · 2017. 2. 22. · 0.525 0.109 0.082...

Carol V. Alexandru, Sebastiano Panichella, Harald C. GallSoftware Evolution and Architecture Lab

University of Zurich, Switzerland{alexandru,panichella,gall}@ifi.uzh.ch

22.02.2017

SANER’17

Klagenfurt, Austria

The Problem Domain

• Static analysis (e.g. #Attr., McCabe, coupling...)

1

The Problem Domain

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0

2


The Problem Domain

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0

2


• Many revisions, fine-grained historical data

A Typical Analysis Process

3

www

clone

select project


www

clone

checkout

select project

select revision

3


www

clone

checkout analysis tool

Res

select project

select revision

applytool

storeanalysis

results

3


www

clone


Res

select project

select revision

applytool

storeanalysis

results

more revisions?

3


www

clone


Res

select project

select revision

applytool

storeanalysis

results

more revisions?

more projects?

3

Redundancies all over...

4

Redundancies in historical code analysis

Impact on Code Study Tools



Few files change

Only small parts of a file change

Across RevisionsImpact on Code

Study Tools

4



Few files change



Study Tools

Repeated analysis of "known" code

4



Few files change


Changes may not even affect results


Study Tools


Storing redundant results

4


4


Few files change

Across Languages



Each language has their own toolchain

Yet they share many metrics


Study Tools





Few files change

Across Languages






Study Tools



Re-implementing identical analyses

Generalizability is expensive

4



Few files change

Across Languages






Study Tools



Re-implementing identical analyses

Generalizability is expensive

5

Important!

6

Techniques implemented

in LISA

Your favouriteanalysis features

Pick what you like!

#1: Avoid Checkouts

Avoid checkouts

7

clone

Avoid checkouts

7

clone

checkout

read write

Avoid checkouts

7

clone

checkout

read

read write

analyze

Avoid checkouts

7

clone

checkout

read

For every file: 2 read ops + 1 write opCheckout includes irrelevant filesNeed 1 CWD for every revision to be analyzed in parallel

read write

analyze

Avoid checkouts

8

clone analyze

read

Avoid checkouts

8

clone analyze

Only read relevant files in a single read opNo write opsNo overhead for parallization

read

Avoid checkouts

8

clone analyze


Git

Analysis Tool

File Abstraction Layer

read

Avoid checkouts

8

clone analyze


Git

Analysis Tool


E.g. for the JDK Compiler:

class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code

}

read

Avoid checkouts

clone analyze


Git

Analysis Tool


E.g. for the JDK Compiler:

class JavaSourceFromCharrArray(name: String, val code: CharBuffer)extends SimpleJavaFileObject(URI.create("string:///" + name), Kind.SOURCE) {override def getCharContent(): CharSequence = code

}

read

9

#2: Use a multi-revision representationof your sources

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

10

rev. 1

rev. 2

rev. 3

rev. 4

11

rev. 1

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

12

rev. 2

Merge ASTs

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

13

rev. 3

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

14

rev. 4

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

15

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4

16

rev. range [1-4]

rev. range [1-2]

Merge ASTs

rev. 1

rev. 2

rev. 3

rev. 4AspectJ (~440k LOC):

1 commit: 2.2M nodesAll >7000 commits: 6.5M nodes

17

Merge ASTs

rev. 1

rev. 2

rev. 3



18

Merge ASTs

rev. 1

rev. 2

rev. 3



19

#3: Store AST nodes only ifthey're needed for analysis

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {

System.out.println(i)

}

}

}

What's the complexity (1+#forks) and name for each method and class?

20

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {


}

}

}

140 AST nodes(using ANTLR)

parse


20

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {


}

}

}


parseCompilationUnit

TypeDeclaration

Modifiers

public

Members

Method

Modifiers

public

Name

run

Name

Demo

Parameters ReturnType

PrimitiveType

VOID

Body

Statements

...

...


20

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {


}

}

}



parse filtered parse

TypeDeclaration

Method

Name

run

Name

DemoForStatement

IfStatement

ConditionalExpression


21

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {


}

}

}

What's the complexity (1+#forks) and name for eachmethod and class?



TypeDeclaration

Method

Name

run

Name

DemoForStatement

IfStatement



22

public class Demo {

public void run() {

for (int i = 1; i< 100; i++) {

if (i % 3 == 0 || i % 5 == 0) {


}

}

}

What's the complexity (1+#forks) and name for eachmethod and class?



TypeDeclaration

Method

Name

run

Name

DemoForStatement

IfStatement



23

#4: Use non-duplicative data structuresto store your results

rev. 1

rev. 2

rev. 3

rev. 4

24

rev. 1

rev. 2

rev. 3

rev. 4

24

[1-1]

label

#attr

mcc

[4-4]

label

#attr

mcc

InnerClass

0 4

1 2 4

[2-3]

label

#attr

mcc

rev. 1

rev. 2

rev. 3

rev. 4 [1-1]

label

#attr

mcc

[4-4]

label

#attr

mcc

InnerClass

0 4

1 2 4

[2-3]

label

#attr

mcc

25

LISA also does:#5: Parallel Parsing#6: Asynchronous graph computation#7: Generic graph computationsapplying to ASTs from compatiblelanguages

26

To Summarize...

The LISA Analysis Process

27

www

clone

select project


27

www

clone

parallel parseinto mergedgraph

select project

Language Mappings(Grammar to Analysis)

determines which ASTnodes are loaded

ANTLRv4Grammar used by

Generates Parser


27

www

clone


Res

select project

storeanalysis

results

Analysis formulated as Graph Computation

Language Mappings(Grammar to Analysis) used by

Async. compute


determines whichdata is persisted


Generates Parser

runson graph


27

www

clone


Res

select project

storeanalysis

results

more projects?

Analysis formulated as Graph Computation

Language Mappings(Grammar to Analysis) used by

Async. compute


determines whichdata is persisted


Generates Parser

runson graph

How well does it work, then?

Marginal cost for +1 revision

41.670

4.6330.525 0.109 0.082 0.071 0.052 0.041 0.032 0.033

0

5

10

15

20

25

30

35

40

45

50

1 10 100 1000 2000 3000 4000 5000 6000 7000

Average Parsing+Computation time per Revision when analyzing n revisions of AspectJ (10 common metrics)

Seco

nd

s

# of revisions

28

Overall Performance Stats

29

Language Java C# JavScript

#Projects 100 100 100

#Revisions 646'261 489'764 204'301

#Files (parsed!) 3'235'852 3'234'178 507'612

#Lines (parsed!) 1'370'998'072 961'974'773 194'758'719

Total Runtime (RT)¹ 18:43h 52:12h 29:09h

Median RT¹ 2:15min 4:54min 3:43min

Tot. Avg. RT per Rev.² 84ms 401ms 531ms

Med. Avg. RT per Rev.² 30ms 116ms 166ms

¹ Including cloning and persisting results² Excluding cloning and persisting results

What's the catch?(There are a few...)

The (not so) minor stuff

• Must implement analyses from scratch

• No help from a compiler

• Non-file-local analyses need some effort

30

The (not so) minor stuff

• Must implement analyses from scratch

• No help from a compiler

• Non-file-local analyses need some effort

• Moved files/methods etc. add overhead

• Uniquely identifying files/entities is hard

• (No impact on results, though)

30

Language matters

31

Javascript C# Java

E.g.: Javascript takes longer because:• Larger files, less modularization• Slower parser (automatic semicolon-insertion)

LISA is E X TREME

32

complexfeature-richheavyweight

simplegeneric

lightweight

Thank you for your attention

SANER ‘17, Klagenfurt, 22.02.2017

Read the paper: http://t.uzh.ch/Fj

Try the tool: http://t.uzh.ch/Fk

Get the slides: http://t.uzh.ch/Fm

Contact me: [email protected]

http://t.uzh.ch/Fj

http://t.uzh.ch/Fk

http://t.uzh.ch/Fm

mailto:[email protected]

Parallelize Parsing

66

Single Git tree traversal

src/Main.java {1: 1251a4}, {3: fc2452}, {4: 251929}

src/Settings.java {2: fa255a}

src/Foo.java {1: 512fc2}, {4: 791c2a}, {5: bcb215}

src/Bar.java {4: 8a23b2}, {5: b2399f}

Obtain sequence of Git blob ids for old versions of each unique path

Parallelize Parsing

67


src/Main.java {1: 1251a4}, {3: fc2452}, {4: 251929}



src/Bar.java {4: 8a23b2}, {5: b2399f}


Parse files with different paths in parallelSome files will have more revisions, taking longer to parse in total Parsing only takes roughly as long as required for the file with the most revisions

Parallelize Parsing

68


src/Main.java {1: 1251a4}, {3: fc2452}, {4: 251929}



src/Bar.java {4: 8a23b2}, {5: b2399f}


Parse files with different paths in parallelSome files will have more revisions, taking longer to parse in total Parsing only takes roughly as long as required for the file with the most revisions

"Speed-up factor" for each technique

• Parallel parsing: Roughly 2

• Merged ASTs: >1000 for many revisions

• Filtered parsing: >10 during

computation, depends on how much is

filtered

• All depends on file sizes / parser speed

Asynchronous Graph Analysis

rev. 1

rev. 2

rev. 3

rev. 4

Depending on the node type:- Signal specfic data

70


rev. 1

rev. 2

rev. 3

rev. 4

Depending on the node type:- Signal specfic data- Collect specific data

71


rev. 1

rev. 2

rev. 3

rev. 4

Depending on the node type:- Signal specfic data- Collect specific data- Store specific data

72


rev. 1

rev. 2

rev. 3

rev. 4

Depending on the node type:- Signal specfic data- Collect specific data- Store specific data- Create new nodes & edges

73

rev. 1

rev. 2

rev. 3

rev. 4

74

Example - Method Count (NOM):

Signal:

Collect:

Store:



rev. 1

rev. 2

rev. 3

rev. 4

75


Signal:METHOD – nom: 1

Collect:

Store:



rev. 1

rev. 2

rev. 3

rev. 4

76



Collect:CLASS-like – nom += s.nom

Store:



rev. 1

rev. 2

rev. 3

rev. 4

77



Collect:CLASS-like – nom += s.nom

Store:CLASS-like – nom



Static source code analysis?

78


79


80

Simple Code Metrics(NOC, NOM, WMC, Complexity, ...)


81

Structure(Coupling, Inheritance, ...)



82




83

Practicecode smells

refactoring advicehot-spot detection

bug prediction




84

Practicecode smells

refactoring advicehot-spot detection

bug prediction

Researchunderstanding software evolution identifying patterns &anti-patternscode quality assessment techniques...→ code studies



(Many) existing studies

85


86


2g

• investigate a small number of projects



source: github.com

source: openhub.net

88



• analyze a few snapshots of multi-year projects

v0.7.0 v1.0.0 v1.3.0 v2.0.0 v3.0.0 v3.3.0 v3.5.0

source: github.com

source: openhub.net

89




90




• focus on very few programming languages

91

Why?

92

Only few LanguagesOnly a few projects Only a few snapshots

Why?

93


Time & resource intensive analysis

Why?

94



Each commit analyzed seperately

Why?

95


Manual work



Why?

96


Manual work



Many preconditions for analyses

Why?

97


Manual work




Why?

98


Manual work




Analysis Tools are purpose-built, not designed for large-scale studies

Why?

99


Manual work




Analysis Tools are purpose-built, not designed for large-scale studies

Possible solution: LISAA fast, multi-purpose, graph-based analysis approach

#3: Use asynchronous graph analysistechniques

Rapid Analysis using LISA

101


Manual work




Fast, general purpose code analysis tool aimed specifically at large scale


102


Manual work


Analyze all commits in parallel




103


«Point and Shoot»






104


«Point and Shoot»



Analysis directly on source code



105

All code is equal(given a parser)

Only a few projects Only a few snapshots

«Point and Shoot»






106


Only a few projects Only a few snapshots

«Point and Shoot»

Quick analysis of 1000s of commits





107


Only a few projects All commits

«Point and Shoot»






108


Only a few projects All commits

«Point and Shoot»





# of Commits: Linear Scaling

MarchUpdate(Total)

109

Multi-Project Parallelization

27

LISA

AspectJ:7642 commits, 440k LOC

Requirements:20 GB memory45 min on 4 cores

Multi-Project Parallelization

27

LISA

AspectJ:7642 commits, 440k LOC

Requirements:20 GB memory45 min on 4 cores

VM VM

VM

VMVM

VM

VM

VM

VM

VM

Parallelization scenario:10x Amazon EC 2 r3.8xlarge10x 244GB memory10x 32 cores

Potential to analyze 160AspectJ-sized projects per hour

(Most projects on git-hub are much smaller)

Benchmarking

10

Benchmarking

10

Is this good or bad?What does it even

mean?

Is this metric too high?

Here‘s a code smell - should I fix it?

Benchmarking

10

Benchmarking

10

Create reference points for orientation

Cluster & Compare

11

Cluster & Compare

11

• Discover „phenotypes“

• By metric values

Cluster & Compare

11



• By metric evolution over

time

Cluster & Compare

11




time

• Compare programming

languages / paradigms

Cluster & Compare

11




time

• Compare programming

languages / paradigms

Find interesting projects for further study &identify evolution patterns to find (un-)desirable ones

Machine Learning

12

Sequence of Changes(observable)

Bug Reports(training)

HMM

Machine Learning

12



HMM

Bug-Proneness(hidden)

Machine Learning

12



HMM

Bug-Proneness(hidden)

Leverage big data to train classifiers and predictors

Study Replication

13

Conclusions X, Y, Z

Study Replication

13

Conclusions X, Y, Z

Replicable with 1000s of projects?

Study Replication

13

Conclusions X, Y, Z

Replicable with 1000s of projects?

Confirmation or rebuttal of existing assumptions

carol v. alexandru, sebastiano panichella, harald c. gall · 2017. 2. 22. · 0.525 0.109 0.082...

Documents