supporting program comprehension with source code summarization

Supporting Program Comprehension with Source Code Summarization

Sonia Haiduc, Jairo Aponte, Andrian Marcus

Presented By: Mohammad Masudur Rahman

2

Contents

Why Code Summarization? Thesis Statement Research Questions about summary Research Questions about tool Automatic Code Summarization Evaluation Experiments Conducted Pyramid Method Important Findings My Observation & Future Works

3

Why Code Summarization?

Program comprehension 50% of all maintenance works

Two extreme approaches – skim through and read thoroughly

Skim through – leads to misunderstanding Read thoroughly – time consuming An intermediate solution – source code entity

with comprehensive textual description

4

Thesis Statement

New idea: code summarization to help in program comprehension (PC)

Applying TR methods like Latent Semantic Indexing in source code summarization.

Combining structural information with retrieved code summary to make it effective for realistic purposes.

5

Research Questions of Code Summarization

Summary should be automatically generated Generate summary to different granularity

levels – class, method, packages etc Shorter than the source code Capture and preserve code semantics and

structure – text as well as structure from the code

Consistent structure – important items at first

6

Research Questions of Code Summarization

Summary should reflect the developer’s understanding about the code

Tool should allow user to change summary and will remember user’s choice in future summary

Tool should rebuild the summary if the code changes or developer’s provide feedback

7

Research Questions about Summarizer Tool

Which summarization technique works the best for source code?

What type of structural info necessary in summary? Will the summary be different for different type of

maintenance task? How long it would be? How much will it resemble to actual summary? How do developers generate summary?

8

Automatic Code Summarization

Generate extractive summary – the most important info extracted from the document

9


Two types info extracted – lexical and structural

Lexical info – identifiers and comments are extracted

Common English and PL keywords are removed

Identifiers are split into constituent words and stemming performed.

10


Extracted lexical info forms the text corpus of code where TR methods (e.g. LSI) used to get most important n words.

Once retrieved, n words are combined with structural info like their class name, method name, package name, parameter name and type etc

How to apply structural info to auto-generated summary is an important part

11


A method name reflects the description of what it does.

If method name ignored by TR, the tool can introduce it automatically

Additional info can be added like –user tags

12

Evaluation

Two types – intrinsic and extrinsic Intrinsic – content evaluation, how closely it depicts

the document or how close to manually generated summary

Metrics- precision, recall, pyramid method Extrinsic – how much utility and usability it has to

support SE tasks – concept location, impact analysis, software reuse, traceability links recovery etc

13

Experiments Conducted

Pyramid method ATunes OS project, 12 methods 6 developers from different demographic

locations, undergraduate students, 3 years Java programming experiences

Developers provided with a list of terms, they need to choose 5 terms for each method that suits best, 60 minutes total time

14

Experiments Conducted

Corpus containing whole code vocabulary Each method is a different document LSI indexing the corpus against each method

terms Cosine measure between corpus and

method and corpus words are ranked Top 5 words from corpus are chosen

15

Pyramid method

Pyramid score = (Sum of A’s score / Total score A could make)

16

Pyramid Score

17

Important Findings

Pyramid score >=.1 and <=.5, marked it encouraging Words chosen by developers – 98.7% in method

name, 88.9% in class name and 84.6% in parameter name

Automatic summary terms – 20% in method name, 12.9% in class name and 30.7% in parameter name

Structural info should be considered properly in automatic summary

Comments text not included in summary

18

My Observation &Future Works

The corpus development technique is not well specified- no specification about redundancy protection

LSI focuses on term frequency rather than structural info which produces bad scores.

During cosine measurement structural info of term in the method could be considered to get better results

There should have some heuristic measurement for structural info.

19

Thank You

Questions?

supporting program comprehension with source code summarization

Education