efficiently mining source code with boa

48
Efficiently Mining Source Code with Boa Robert Dyer The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600. Tien N. Nguyen Hridesh Rajan Hoan Anh Nguyen

Upload: asha

Post on 15-Feb-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Hoan Anh Nguyen. Tien N. Nguyen. Hridesh Rajan. Efficiently Mining Source Code with Boa. Robert Dyer. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Efficiently Mining Source Code with  Boa

Efficiently Mining Source Codewith Boa

Robert Dyer

The research activities described in this talk were supported in part by the US National Science Foundation (NSF) grants CCF-13-49153, CCF-13-20578, TWC-12-23828, CCF-11-17937, CCF-10-17334, and CCF-10-18600.

Tien N. NguyenHridesh Rajan Hoan Anh Nguyen

Page 2: Efficiently Mining Source Code with  Boa

2

What do I mean bysoftware repository?

Page 3: Efficiently Mining Source Code with  Boa

3

Page 4: Efficiently Mining Source Code with  Boa

4

What features do they have?

Page 5: Efficiently Mining Source Code with  Boa

5

What do I mean bymining software repositories (MSR)?

Page 6: Efficiently Mining Source Code with  Boa

6

Page 7: Efficiently Mining Source Code with  Boa

7

What are some examples ofsoftware repository mining?

Page 8: Efficiently Mining Source Code with  Boa

8

What is the most used programming language?

Page 9: Efficiently Mining Source Code with  Boa

9

How many words are in commit messages?

Words[] = update, 30715Words[] = cleanup, 19073Words[] = updated, 18737Words[] = refactoring, 11981Words[] = fix, 11705Words[] = test, 9428Words[] = typo, 9288Words[] = updates, 7746Words[] = javadoc, 6893Words[] = bugfix, 6295

Page 10: Efficiently Mining Source Code with  Boa

10

How has unit testing evolved over time?

JUnit 4 release

Page 11: Efficiently Mining Source Code with  Boa

11

What makes thisultra-large-scale mining?

Page 12: Efficiently Mining Source Code with  Boa

12

Previous examples queried...

Projects 699,331

Code Repositories 494,158

Revisions 15,063,073

Unique Files 69,863,970

File Snapshots 147,074,540

AST Nodes 18,651,043,23

Over 250GB of pre-processed data

Page 13: Efficiently Mining Source Code with  Boa

13

What doesbringing BIGDATA to the masses

mean?

Page 14: Efficiently Mining Source Code with  Boa

14

How has unit testing evolved over time?

How can we solve this task?

Page 15: Efficiently Mining Source Code with  Boa

15

Resultsforeachmine project

metadata

Hasrepository?

Method has@Test?

yes

yes

Accessrepository

Find allmethods

Find allsource files

minerevisions

minesources

Page 16: Efficiently Mining Source Code with  Boa

16

Resultsforeachmine project

metadata

Hasrepository?

Method has@Test?

yes

yes

Accessrepository

Find allmethods

Find allsource files

minerevisions

minesources

Challenge: Volume

Page 17: Efficiently Mining Source Code with  Boa

17

Challenge: Volume

Projects 699,331

Code Repositories 494,158

Revisions 15,063,073

Unique Files 69,863,970

File Snapshots 147,074,540

AST Nodes 18,651,043,23

How do you:

Find such a large dataset? Transform the data for analysis?Access this data? Efficiently analyze the data?Store the data?

Page 18: Efficiently Mining Source Code with  Boa

18

Resultsforeachmine project

metadata

Hasrepository?

Method has@Test?

yes

yes

Accessrepository

Find allmethods

Find allsource files

minerevisions

minesources

Challenge: Velocity

Page 19: Efficiently Mining Source Code with  Boa

19

Challenge: Velocity

Page 20: Efficiently Mining Source Code with  Boa

20

Challenge: Velocity

Page 21: Efficiently Mining Source Code with  Boa

21

Resultsforeachmine project

metadata

Hasrepository?

Method has@Test?

yes

yes

Accessrepository

Find allmethods

Find allsource files

minerevisions

minesources

Challenge: Variety

Page 22: Efficiently Mining Source Code with  Boa

22

Challenge: Variety

Page 23: Efficiently Mining Source Code with  Boa

Ultra-large-scale Software Repository MiningThe Boa Experience

[ICSE'14][ICSE'13][GPCE'13][SPLASH'13 SRC][TOSEM] (under review)

Page 24: Efficiently Mining Source Code with  Boa

24

Boa's Architecture

Replicate

Stored oncluster

User submitsquery

Deployed andexecuted on cluster

Query resultreturnedvia web

cache

Boa's Data Infrastructure

and Transform

Compiled intoHadoop program

Boa's Computing Infrastructure

Page 25: Efficiently Mining Source Code with  Boa

25

Resultsforeachmine project

metadata

Hasrepository?

Method has@Test?

yes

yes

Accessrepository

Find allmethods

Find allsource files

minerevisions

minesources

Challenge: Volume

Challenge: VelocityChallenge: Variety

Page 26: Efficiently Mining Source Code with  Boa

26

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Automatically parallelized

Analyzes 18 billion AST nodes in minutes

Only 10 lines of code

No external libraries

A better solution...

Page 27: Efficiently Mining Source Code with  Boa

27

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

Page 28: Efficiently Mining Source Code with  Boa

28

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

});

Page 29: Efficiently Mining Source Code with  Boa

29

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier ->

});

Page 30: Efficiently Mining Source Code with  Boa

30

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION &&

});

Page 31: Efficiently Mining Source Code with  Boa

31

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name))

});

Page 32: Efficiently Mining Source Code with  Boa

32

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

visit(input, visitor {

before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Page 33: Efficiently Mining Source Code with  Boa

33

How has unit testing evolved over time?

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Page 34: Efficiently Mining Source Code with  Boa

34

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

input = project1

input = project2

input = project3

input = projectn

.

.

.

Dataset

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Boa Program

Boa Program

Boa Program

Boa Program

.

.

.

Tests

Tests[631152000] = 5Tests[631154020] = 12Tests[631161103] = 14Tests[631172392] = 18 . . .

Output

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Tests[631152000] << 1;

631152000, 1

Tests[631154020] << 1;

631152000, 1631154020, 1631152000, 1631154020, 1631154020, 1631161103, 1

Page 35: Efficiently Mining Source Code with  Boa

35

Automatic Parallelization

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Output variables with built in aggregator functions:sum, mean, top(k), bottom(k), set, collection, etc

Compiler generates Hadoop MapReduce code

Page 36: Efficiently Mining Source Code with  Boa

36

Abstracting MSR with Types

Tests: output sum[timestamp] of int;

cur_time:timestamp;

visit(input, visitor { before n: Revision -> cur_time = n.commit_date; before n: Modifier -> if (n.kind == ModifierKind.ANNOTATION && match(`^(org\.junit\.)?Test$`, n.annotation_name)) Tests[cur_time] << 1;});

Custom domain-specific types for mining software repositories5 base types and 9 types for source code

No need to understand multiple data formats or APIs

Page 37: Efficiently Mining Source Code with  Boa

37

Abstracting MSR with Types

Project

CodeRepository

Revision

ChangedFile

ASTRoot

1

1..*

1

*

1

*

1

0..1

Page 38: Efficiently Mining Source Code with  Boa

38

Abstracting MSR with Types

ASTRoot

Namespace

Declaration

1

*

1

1..*

Method Variable Type

1

*

1

*

1

*

Statement Expression*

*1

1

Page 39: Efficiently Mining Source Code with  Boa

39

Challenge: How can we make mining source code easier?

Answer: Declarative Visitors

Page 40: Efficiently Mining Source Code with  Boa

40

Background: Visitor Pattern

Rectangle

Triangle

draw(Graphics g)scale(int x, int y)

Circledraw(Graphics g)scale(int x, int y)

draw(Graphics g)scale(int x, int y)

Rectangle

Triangle

accept(Visitor v)

Circleaccept(Visitor v)

accept(Visitor v)

DrawVisitorvisit(Rectangle r)

visit(Circle c)visit(Triangle t)

ScaleVisitorvisit(Rectangle r)

visit(Circle c)visit(Triangle t)

Page 41: Efficiently Mining Source Code with  Boa

41

Easing Source Code Mining with Visitors

id := visitor {before T -> statement;after T -> statement;

};

visit(node, id);

Page 42: Efficiently Mining Source Code with  Boa

42

Easing Source Code Mining with Visitors

id := visitor {before id : T1 -> statement;

before T2, T3 -> statement;

before _ -> statement;};

Page 43: Efficiently Mining Source Code with  Boa

43

Easing Source Code Mining with Visitors

ASTRoot

Namespace

Declaration

Method Variable Type

Statement Expression

ASTRoot

Namespace

Declaration

Method Variable Type

Statement Expression

Page 44: Efficiently Mining Source Code with  Boa

44

before n: Declaration -> {

}

Easing Source Code Mining with Visitors

Method Type

Statement Expression

ASTRoot

Namespace

Declaration

Variable

before n: Declaration -> {foreach (i: int; n.fields[i])

visit(n.fields[i]);

}

before n: Declaration -> {foreach (i: int; n.fields[i])

visit(n.fields[i]);stop;

}

Page 45: Efficiently Mining Source Code with  Boa

45

Let's see it in action!

http://boa.cs.iastate.edu/boa/

Page 46: Efficiently Mining Source Code with  Boa

46

Summary

Ultra-large-scale software repository miningposes several challenges

Automatically parallelizes queries

Domain-specific language, types, and functionsto make mining software repositories easier

Boa provides abstractions to addressthese challenges

Ultra-large-scale dataset with almost 700k projects

Page 47: Efficiently Mining Source Code with  Boa

47

Boa's Global Impact

90+ users from over 20 countries!

Page 48: Efficiently Mining Source Code with  Boa

48

Thank you!

http://boa.cs.iastate.edu/