supporting program comprehension with source code summarization

19
Supporting Program Comprehension with Source Code Summarization Sonia Haiduc, Jairo Aponte, Andrian Marcus Presented By: Mohammad Masudur Rahman

Upload: masud-rahman

Post on 18-Dec-2014

491 views

Category:

Education


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Supporting program comprehension with source code summarization

Supporting Program Comprehension with Source Code Summarization

Sonia Haiduc, Jairo Aponte, Andrian Marcus

Presented By: Mohammad Masudur Rahman

Page 2: Supporting program comprehension with source code summarization

2

Contents

Why Code Summarization? Thesis Statement Research Questions about summary Research Questions about tool Automatic Code Summarization Evaluation Experiments Conducted Pyramid Method Important Findings My Observation & Future Works

Page 3: Supporting program comprehension with source code summarization

3

Why Code Summarization?

Program comprehension 50% of all maintenance works

Two extreme approaches – skim through and read thoroughly

Skim through – leads to misunderstanding Read thoroughly – time consuming An intermediate solution – source code entity

with comprehensive textual description

Page 4: Supporting program comprehension with source code summarization

4

Thesis Statement

New idea: code summarization to help in program comprehension (PC)

Applying TR methods like Latent Semantic Indexing in source code summarization.

Combining structural information with retrieved code summary to make it effective for realistic purposes.

Page 5: Supporting program comprehension with source code summarization

5

Research Questions of Code Summarization

Summary should be automatically generated Generate summary to different granularity

levels – class, method, packages etc Shorter than the source code Capture and preserve code semantics and

structure – text as well as structure from the code

Consistent structure – important items at first

Page 6: Supporting program comprehension with source code summarization

6

Research Questions of Code Summarization

Summary should reflect the developer’s understanding about the code

Tool should allow user to change summary and will remember user’s choice in future summary

Tool should rebuild the summary if the code changes or developer’s provide feedback

Page 7: Supporting program comprehension with source code summarization

7

Research Questions about Summarizer Tool

Which summarization technique works the best for source code?

What type of structural info necessary in summary? Will the summary be different for different type of

maintenance task? How long it would be? How much will it resemble to actual summary? How do developers generate summary?

Page 8: Supporting program comprehension with source code summarization

8

Automatic Code Summarization

Generate extractive summary – the most important info extracted from the document

Page 9: Supporting program comprehension with source code summarization

9

Automatic Code Summarization

Two types info extracted – lexical and structural

Lexical info – identifiers and comments are extracted

Common English and PL keywords are removed

Identifiers are split into constituent words and stemming performed.

Page 10: Supporting program comprehension with source code summarization

10

Automatic Code Summarization

Extracted lexical info forms the text corpus of code where TR methods (e.g. LSI) used to get most important n words.

Once retrieved, n words are combined with structural info like their class name, method name, package name, parameter name and type etc

How to apply structural info to auto-generated summary is an important part

Page 11: Supporting program comprehension with source code summarization

11

Automatic Code Summarization

A method name reflects the description of what it does.

If method name ignored by TR, the tool can introduce it automatically

Additional info can be added like –user tags

Page 12: Supporting program comprehension with source code summarization

12

Evaluation

Two types – intrinsic and extrinsic Intrinsic – content evaluation, how closely it depicts

the document or how close to manually generated summary

Metrics- precision, recall, pyramid method Extrinsic – how much utility and usability it has to

support SE tasks – concept location, impact analysis, software reuse, traceability links recovery etc

Page 13: Supporting program comprehension with source code summarization

13

Experiments Conducted

Pyramid method ATunes OS project, 12 methods 6 developers from different demographic

locations, undergraduate students, 3 years Java programming experiences

Developers provided with a list of terms, they need to choose 5 terms for each method that suits best, 60 minutes total time

Page 14: Supporting program comprehension with source code summarization

14

Experiments Conducted

Corpus containing whole code vocabulary Each method is a different document LSI indexing the corpus against each method

terms Cosine measure between corpus and

method and corpus words are ranked Top 5 words from corpus are chosen

Page 15: Supporting program comprehension with source code summarization

15

Pyramid method

Pyramid score = (Sum of A’s score / Total score A could make)

Page 16: Supporting program comprehension with source code summarization

16

Pyramid Score

Page 17: Supporting program comprehension with source code summarization

17

Important Findings

Pyramid score >=.1 and <=.5, marked it encouraging Words chosen by developers – 98.7% in method

name, 88.9% in class name and 84.6% in parameter name

Automatic summary terms – 20% in method name, 12.9% in class name and 30.7% in parameter name

Structural info should be considered properly in automatic summary

Comments text not included in summary

Page 18: Supporting program comprehension with source code summarization

18

My Observation &Future Works

The corpus development technique is not well specified- no specification about redundancy protection

LSI focuses on term frequency rather than structural info which produces bad scores.

During cosine measurement structural info of term in the method could be considered to get better results

There should have some heuristic measurement for structural info.

Page 19: Supporting program comprehension with source code summarization

19

Thank You

Questions?