supporting program comprehension with source code summarization
DESCRIPTION
TRANSCRIPT
Supporting Program Comprehension with Source Code Summarization
Sonia Haiduc, Jairo Aponte, Andrian Marcus
Presented By: Mohammad Masudur Rahman
2
Contents
Why Code Summarization? Thesis Statement Research Questions about summary Research Questions about tool Automatic Code Summarization Evaluation Experiments Conducted Pyramid Method Important Findings My Observation & Future Works
3
Why Code Summarization?
Program comprehension 50% of all maintenance works
Two extreme approaches – skim through and read thoroughly
Skim through – leads to misunderstanding Read thoroughly – time consuming An intermediate solution – source code entity
with comprehensive textual description
4
Thesis Statement
New idea: code summarization to help in program comprehension (PC)
Applying TR methods like Latent Semantic Indexing in source code summarization.
Combining structural information with retrieved code summary to make it effective for realistic purposes.
5
Research Questions of Code Summarization
Summary should be automatically generated Generate summary to different granularity
levels – class, method, packages etc Shorter than the source code Capture and preserve code semantics and
structure – text as well as structure from the code
Consistent structure – important items at first
6
Research Questions of Code Summarization
Summary should reflect the developer’s understanding about the code
Tool should allow user to change summary and will remember user’s choice in future summary
Tool should rebuild the summary if the code changes or developer’s provide feedback
7
Research Questions about Summarizer Tool
Which summarization technique works the best for source code?
What type of structural info necessary in summary? Will the summary be different for different type of
maintenance task? How long it would be? How much will it resemble to actual summary? How do developers generate summary?
8
Automatic Code Summarization
Generate extractive summary – the most important info extracted from the document
9
Automatic Code Summarization
Two types info extracted – lexical and structural
Lexical info – identifiers and comments are extracted
Common English and PL keywords are removed
Identifiers are split into constituent words and stemming performed.
10
Automatic Code Summarization
Extracted lexical info forms the text corpus of code where TR methods (e.g. LSI) used to get most important n words.
Once retrieved, n words are combined with structural info like their class name, method name, package name, parameter name and type etc
How to apply structural info to auto-generated summary is an important part
11
Automatic Code Summarization
A method name reflects the description of what it does.
If method name ignored by TR, the tool can introduce it automatically
Additional info can be added like –user tags
12
Evaluation
Two types – intrinsic and extrinsic Intrinsic – content evaluation, how closely it depicts
the document or how close to manually generated summary
Metrics- precision, recall, pyramid method Extrinsic – how much utility and usability it has to
support SE tasks – concept location, impact analysis, software reuse, traceability links recovery etc
13
Experiments Conducted
Pyramid method ATunes OS project, 12 methods 6 developers from different demographic
locations, undergraduate students, 3 years Java programming experiences
Developers provided with a list of terms, they need to choose 5 terms for each method that suits best, 60 minutes total time
14
Experiments Conducted
Corpus containing whole code vocabulary Each method is a different document LSI indexing the corpus against each method
terms Cosine measure between corpus and
method and corpus words are ranked Top 5 words from corpus are chosen
15
Pyramid method
Pyramid score = (Sum of A’s score / Total score A could make)
16
Pyramid Score
17
Important Findings
Pyramid score >=.1 and <=.5, marked it encouraging Words chosen by developers – 98.7% in method
name, 88.9% in class name and 84.6% in parameter name
Automatic summary terms – 20% in method name, 12.9% in class name and 30.7% in parameter name
Structural info should be considered properly in automatic summary
Comments text not included in summary
18
My Observation &Future Works
The corpus development technique is not well specified- no specification about redundancy protection
LSI focuses on term frequency rather than structural info which produces bad scores.
During cosine measurement structural info of term in the method could be considered to get better results
There should have some heuristic measurement for structural info.
19
Thank You
Questions?