cross-project build co-change prediction

83
Cross-Project Build Co-change Prediction Shane McIntosh Ahmed E. Hassan [email protected] @shane_mcintosh shanemcintosh.org Emad Shihab David Lo Xin Xia

Upload: shane-mcintosh

Post on 19-Jul-2015

522 views

Category:

Software


0 download

TRANSCRIPT

Cross-Project Build Co-change Prediction

Shane McIntosh

Ahmed E. Hassan

[email protected]@shane_mcintoshshanemcintosh.org

Emad Shihab

David Lo

Xin Xia

What is a build system?

Source code

2

What is a build system?

Source code

Deliverable

2

.tex

.c

.cc

.o

.o

.dvi

.a

.exe

.pdf

.deb

Build systems describe how sources aretranslated into deliverables

3

The build system is at theheart of techniques like

Continuous Integration (CI)

4

.c .mk

The build system is at theheart of techniques like

Continuous Integration (CI)

Commit

4

Commit 9719cf0

.c .mk

The build system is at theheart of techniques like

Continuous Integration (CI)

Commit

4

BuildCommit 9719cf0

.c .mk

The build system is at theheart of techniques like

Continuous Integration (CI)

Commit

4

Build

Test

Commit 9719cf0

.c .mk

The build system is at theheart of techniques like

Continuous Integration (CI)

Commit

4

Build

Test

ReportCommit 9719cf0 was successfully integrated

Commit 9719cf0

.c .mk

The build system is at theheart of techniques like

Continuous Integration (CI)

Commit

4

Build

Test

ReportCommit 9719cf0 was successfully integrated

Commit 9719cf0

.c .mk

“...nothing can be said to be certain, except death and taxes” - Benjamin Franklin

The Build “Tax”

An Empirical Study of Build Maintenance Effort

S. McIntosh, B. Adams, T. H. D. Nguyen, Y. Kamei, A. E. Hassan

[ICSE 2011]

Up to 27% of source changes require build

changes, too!

5

Neglected build maintenanceis a frequent cause of

build breakage

6

.c .mk

Neglected build maintenanceis a frequent cause of

build breakage

Commit

6

Commit aedd38

.c

.mk

Neglected build maintenanceis a frequent cause of

build breakage

Commit

6

Commit aedd38

.c

.mk

Neglected build maintenanceis a frequent cause of

build breakage

Commit

6

BuildCommit aedd38

.c

.mk

Neglected build maintenanceis a frequent cause of

build breakage

Commit

6

Build

Test

Commit aedd38

.c

.mk

Neglected build maintenanceis a frequent cause of

build breakage

Commit

6

Build

Test

Commit aedd38

.c

.mk

Neglected build maintenanceis a frequent cause of

build breakage

Commit

6

Build

Test

Report

Commit aedd38

.c

.mk

Commit aedd38 broke the build!

Neglected build maintenancecan even impact end users

7

Neglected build maintenancecan even impact end users

7

Not working due to linking of

incorrect SQLite library version

Neglected build maintenancecan even impact end users

7

Not working due to linking of

incorrect SQLite library version

When are buildchanges necessary?

8

Overview of the studied systems

8

Overview of the studied systems

29 years of

historical data

8

Overview of the studied systems

29 years of

historical dataProprietary and open source systems

Grouping related changes according to the work items that they address

9

Grouping related changes according to the work items that they address

.c .c .cChanges .mk

9

Missed codein #2121

Add feature#2121

Fix forbug #1234

Grouping related changes according to the work items that they address

.c .c .c

Transactions

Changes .mk

9

2121

Missed codein #2121

Add feature#2121

1234

Fix forbug #1234

Grouping related changes according to the work items that they address

.c .c .c

Transactions

Work items

Changes .mk

9

1 2

.mk

10

We train classifiers to identify code changes that require build co-changes

Workitems

.c.c .c

Classification model

Build change necessary

No build change necessary

1 2

.mk

10

We train classifiers to identify code changes that require build co-changes

Workitems

.c

.c .cClassification model

Build change necessary

No build change necessary

1 2

.mk

11

Workitems

.c

Build changenecessary

No build change necessary

Classification model

We train classifiers to identify code changes that require build co-changes

12

Prior work shows that within-project build co-change prediction can be accurate

Mining Co-Change Information to Understand when Build Changes

are NecessaryS. McIntosh, B. Adams, M. Nagappan, A. E. Hassan

[ICSME 2014]

Build co-change classifiers can achieve an AUC of 0.60-0.88

However, a large amount of historical data was used to train the classifiers

13

However, a large amount of historical data was used to train the classifiers

13

However, a large amount of historical data was used to train the classifiers

13

What about new

projects?

However, a large amount of historical data was used to train the classifiers

13

What about new

projects?

…or projects with poorly-recorded historical data?

However, a large amount of historical data was used to train the classifiers

13

What about new

projects?

…or projects with poorly-recorded historical data?

Can we leverage these largecorpora for the small ones?

14

14

How well do build co-change prediction models perform on sparse data?

Precision

Recall

F1-score

AUC

0 0.25 0.5 0.75 1

5%50%90%

14

How well do build co-change prediction models perform on sparse data?

Precision

Recall

F1-score

AUC

0 0.25 0.5 0.75 1

5%50%90%

Challenge 1:Very small datasets tend

to yield models that under-perform

14

How well do build co-change prediction models perform on sparse data?

Precision

Recall

F1-score

AUC

0 0.25 0.5 0.75 1

5%50%90%

How well do build co-change prediction models

perform on other datasets?

Precision

Recall

F1-score

AUC

0 0.25 0.5 0.75 1

Eclipse => MozillaJazz => MozillaLucene => Mozilla

Challenge 1:Very small datasets tend

to yield models that under-perform

14

How well do build co-change prediction models perform on sparse data?

Precision

Recall

F1-score

AUC

0 0.25 0.5 0.75 1

5%50%90%

How well do build co-change prediction models

perform on other datasets?

Precision

Recall

F1-score

AUC

0 0.25 0.5 0.75 1

Eclipse => MozillaJazz => MozillaLucene => Mozilla

Challenge 1:Very small datasets tend

to yield models that under-perform

Challenge 2:Cross-project build co-change models tend

to under-perform

15

Domain-specific project characteristics may limit the applicability of cross-project models

Training corpus

Testing corpus

Training corpus

16

Classification model

Testing corpus

Domain-specific project characteristics may limit the applicability of cross-project models

Training corpus

16

Classification model

Testing corpus

?

Domain-specific project characteristics may limit the applicability of cross-project models

17

Using transfer learning to provide some domain knowledge to the training corpus

Training corpus

Testing corpus

Move some training data from target

system to the training corpus

17

Using transfer learning to provide some domain knowledge to the training corpus

Training corpus

Testing corpus

18

Training corpus

Testing corpus

Using transfer learning to provide some domain knowledge to the training corpus

19

Training corpus

Testing corpus

Classification model

Using transfer learning to provide some domain knowledge to the training corpus

19

Training corpus

Testing corpus

Classification model

?

Using transfer learning to provide some domain knowledge to the training corpus

20

Challenge 3:Build co-changes are the minority

20

Challenge 3:Build co-changes are the minority

Only 8%-17% of changes are build co-changing

21

Training corpus

Testing corpus

Use training corpus to find an appropriate threshold

22

Training corpus

Testing corpus

Classification model

Use training corpus to find an appropriate threshold

Set aside the testing corpus

22

Training corpus

Testing corpus

Classification model

Use training corpus to find an appropriate threshold

23

Training corpus

Classification model

Use training corpus to find an appropriate threshold

Training corpus

Incorrectly classified!

23

Training corpus

Classification model

Use training corpus to find an appropriate threshold

Training corpus

24

Use training corpus to find an appropriate threshold

Training corpus

Classification model

24

Use training corpus to find an appropriate threshold

Training corpus

Classification model 1

25

Use training corpus to find an appropriate threshold

Training corpus

Classification model

Classification model 1

2

25

Use training corpus to find an appropriate threshold

Training corpus

Classification model

Classification model 1

2

26

Use training corpus to find an appropriate threshold

Classification model

Classification model 1

2…

Classification model N

Ensemble of models used on

the testing corpus

26

Use training corpus to find an appropriate threshold

Classification model

Classification model 1

2…

Classification model N

27

Evaluating our approach

Relativeperformance

27

Evaluating our approach

Relativeperformance

Training configurationsensitivity

Sour

ceTa

rget

28

Evaluating our approach

Relativeperformance

Sour

ceTa

rget

Training configurationsensitivity

29

Our approach outperforms baselinecross-project approaches

Eclipse

Jazz

Lucene

Mozilla

Average

0 0.25 0.5 0.75 1

Our approach Ordinary cross-project AdaBoost TrAdaBoost

Wor

st m

easu

red

F-sc

ore

29

Our approach outperforms baselinecross-project approaches

Eclipse

Jazz

Lucene

Mozilla

Average

0 0.25 0.5 0.75 1

Our approach Ordinary cross-project AdaBoost TrAdaBoost

Wor

st m

easu

red

F-sc

ore

37%-42% improvement

30

Our approach achieves similar results to within-project models

Eclipse

Jazz

Lucene

Mozilla

Average

0 0.25 0.5 0.75 1

Our approach Within-project

Wor

st m

easu

red

F-sc

ore

30

Our approach achieves similar results to within-project models

Eclipse

Jazz

Lucene

Mozilla

Average

0 0.25 0.5 0.75 1

Our approach Within-project

Only a 7% drop in performance

Wor

st m

easu

red

F-sc

ore

31

Evaluating our approach

Relativeperformance

Sour

ceTa

rget

Training configurationsensitivity

31

Evaluating our approach

Relativeperformance

37%-42% improvement over baseline

Sour

ceTa

rget

Training configurationsensitivity

31

Evaluating our approach

Relativeperformance

37%-42% improvement over baseline

Only 7% dropof within-project

F-measureSo

urce

Targ

et

Training configurationsensitivity

32

Evaluating our approach

Relativeperformance

Sour

ceTa

rget

37%-42% improvement over baseline

Only 7% dropof within-project

F-measure

Training configurationsensitivity

33

Additional data from the target system slowly improves classifier performance

Sour

ce

Targ

et

319

F-sc

ore

34

Evaluating our approach

Relativeperformance

Sour

ceTa

rget

37%-42% improvement over baseline

Only 7% dropof within-project

F-measure

Training configurationsensitivity

34

Evaluating our approach

Relativeperformance

Sour

ceTa

rget

37%-42% improvement over baseline

Only 7% dropof within-project

F-measure

Training configurationsensitivity

F-score tends to improve as more target system data becomes available