presentation 7 summary

Presentation 7 Summary

Cross Language Clone AnalysisTeam 2

November 22, 2010

• Feasibility Study• Release Plan• Architecture• Parsing• CodeDOM• Clone Analysis• Testing• Demonstration• Team Collaboration• Path Forward

Agenda

2

Allen Tucker Patricia Bradford Greg Rodgers Brian Bentley Ashley Chafin

Our Team

3

Feasibility StudyOur evaluation of the project to determine the difficulty in carrying out the task.

4

Our Customers: Dr. Etzkorn and Dr. Kraft Customer Request:

◦ A tool that will abstract programs in C++, C#, Java, and (Python or VB) to the Dagstuhl Middle Metamodel, Microsoft CodeDOM or something similar, and detect cross-language clones.

Areas to Note: ◦ the user interface◦ easy comparisons of clones◦ visualization of clones◦ sub-clones◦ clone detection for large bodies of code

Task Summary

5

Per our task, in order to find clones across different programming languages, we will have to first convert the code from each language over to a language independent object model.

Some Language Independent Object Models:◦ Dagstuhl Middle Metamodel (DMM)◦ Microsoft CodeDOM

Both of these models provide a language independent object model for representing the structure of source code.

Task Summary (cont.)

6

Three Step Process• Step 1 Code Translation

• Step 2 Clone Detection

• Step 3 Visualization

Task Understanding

Source Files Translator Common

Model

Common Model Inspector Detected

Clones

Detected Clones UI Clone

Visualization

7

Benefits Fact: Modularity is a key characteristic in

today’s software world

Why? Allows us to divide software into a decomposed separation of concerns◦ Attributes to maintainability, reusability, testability

and reliability

Clone Detection allows us to detect common software spread across large bodies of code◦ Identify code that is subject to further modularity

8

Features Clone Detection Software Suite

◦ Identifies◦ Tracks◦ Manages Software Clones

Multi-language support◦ C++◦ C#◦ Java

9

Features (cont) Provides complete code coverage Multi-Application Support

◦ Stand-alone◦ Plug-in based (Eclipse)◦ Backend service (Ant task)

Extendible◦ Built on a Plug-in Framework◦ Add new languages

Easy to Navigate between Clones Persists Clones for easy Retrieval

10

Complexity of problem proves more difficult than initial estimates.

Technology to be applied is neither well-established or has yet to be developed.

Unable to complete defined project scope within schedule.

Volatile user requirements leading to redefinition of project objectives.

Risk Analysis

11

Release PlanRelease Plan and User Stories

12

Came out with original Release Plan on 9/15/20

Due to customer wants/needs, we had to re-tool our user stories.

Dr. Etzkorn’s main concerns: Load source code and translate to a language

independent model Analyze the translated source code for clones

◦ Results from meeting: Created two new user stories (see next two slides) These two user stories have been pushed to the front

of our card stack

Re-tooled User Stories

13

CS 666 Studio I User Stories

Phase I

Story ID:Priority:Estimate:

017

1

14 Days

15

As an analyst I want the to load and translate my source code projects so I can analyze the source for clones.

Source Code Load & Translate


018

1

14 Days

16

As an analyst I want the to analyze my source code projects so I can see the clones.

Source Code Analyze


002

1

14 Days

17

As a analyst I want the capability to have the source code associated with clones highlighted within source files so that they are easy to identify.

Code Clone Highlights

Current TasksRequirements & Models

18

Requirements modeling for the first user story “Source Code Load & Translate”:◦ Load & parse C#, Java, C++ source code.◦ Translate the parsed C#, Java, C++ source code

to CodeDOM.◦ Associate the CodeDOM to the original source

code. Requirements modeling for the second user

story “Source Code Analyze”:◦ Analyze CodeDom for clones.

Current Tasks’ Requirements

19

UML Model – Load & Parse

20

UML Model – Translate

21

UML Model – Associate

22

UML Model – Analyze

23

ArchitectureDesign and Architecture

24

Key Architecture Points Multilanguage support

Configurable for different platforms◦ Stand-along application◦ plug-in◦ backend service

Extendable

25

Architecture

C# Service

Java Service

C++ Service

ApplicationUser Interface

Code Model

Clone Detection Algorithms

Core

APILanguage Support

(Interface)

26

Service

EclipsePlug-in

Etc…

WebInterface

Core Unit Code Model

◦ Stores the code in common format Application Programming Interface

◦ Used to embed clone detection in applications Language Service Interface

◦ Communication layer between the core and the specific language services

Code ModelClone Detection Algorithms

Core

API

Language Service Interface

27

App Configuration

28

CRC Card SamplingClass Responsibility Collaboration Cards

29

Java ParserParse Java source code LALRParser (Gold Parser)Construct Java token tree

Java Parser CRC

30

ParserParse C# source code LALRParser (Gold Parser)Construct C# token tree

C# Parser CRC

31

LanguageServiceDefines standard interface for all language providers.

ILanguageService

Language ServiceCRC

32

JavaServiceReads Java source code Java ParserUnderstands Java grammar production rules

CloneDetection

Construct CodeDOM compilation unit

JavaCodeProvider

ILanguageService

Java Service CRC

33

CsServiceReads C# source code C# ParserUnderstands C# grammar production rules

CloneDetection

Construct CodeDOM compilation unit

CsCodeProvider

ILanguageService

Cs Service CRC

34

CloneDectionLoads and manages languages services.

ILanguageService

Controls parsingEstablishes CodeDOM compilation units to source code file associationsCompares code segments CodeDomComparerProvides bookkeeping for code segments

CodeDomSummary

CloneDetectionCRC

35

ParsingOur struggles and our successes.

36

We explored and conducted spikes on CSParser and CS CodeDOM Parser.◦ They both had advantages and disadvantage.◦ We came to the conclusion that neither of them

were going to fit our needs. We explored and conducted a spike on

GOLD Parser.◦ We ultimately chose the GOLD Parser because it

best fit our needs. This gave us a way to manage multiple language

grammars with one engine.

Parsing Struggles & Successes

37

GOLD Parsing SystemGOLD Parsing Populating CodeDOM

38

How It Works (Block Structure)

Grammar Builder

Compiled Grammar

Table (*.cgt)

Engine

Source Code

Parsed

Data

39

How It Works (Process)

Grammar Builder

Compiled Grammar

Table (*.cgt)

Engine

Source Code

Parsed

Data

Typical output from engine: a long nested tree

40

Usage within CloneDigger

Compiled Grammar

Table (*.cgt)

Engine

Source Code

Parsed

Data

CodeDOM Conversion• Need to write routine to move

data from Parsed Tree to CodeDOM• Parsed data trees from parser

are stored in consistent data structure, but are based on rules defined within grammars

CodeDOM Conversi

on

AST

41

Grammar UpdatesBookkeeping for parsing the multiple grammars.

42

Grammar Updates Currently the grammars we have for the

Gold parser are out dated.

Current Gold Grammars◦ C# version 2.0◦ Java version 1.4

Current available software versions◦ C# version 4.0◦ Java version 6

43

Grammars for C# and Java are very complex and require a lot of work to build.

Antler and Gold Parser grammars use completely different syntax.

Positive note: Other development not halted by use of older grammars.

Grammar Update Issues

44

Our BookkeepingBookkeeping for parsing the multiple grammars

45

For Java, there is…◦ 359 production rules◦ 249 distinctive symbols (terminal & non-terminal)

For C#, there is…◦ 415 production rules◦ 279 distinctive symbols (terminal & non-terminal)

Compiled Grammar Table

46

Production Rule Dependancies

47

Since there are so many production rules, we came up with the following bookkeeping:

A spreadsheet of the compiled grammar table (for each language) with each production rule indexed.◦ This spreadsheet covers:

various aspects of language what we have/have not handled from the parser what we have/have not implemented into CodeDOM percentage complete

Our Grammar Bookkeeping

48

Our Grammar Bookkeeping

49

Parsing Handlers’ Status:◦ C# = 100% complete◦ Java = 100% complete

Parsing & CodeDOM Status

50

CodeDOMLanguage Independent Object Model

51

CodeDOM Document Object Model for Source Code

API - [System.CodeDom]

Only supports certain aspects of the language since it’s language agnostic◦ Good Enough

What Does it Do?◦ Programmatically Constructs Code

What Doesn’t it Do?◦ Does NOT parse

52

CodeDOM Example CodeCompileUnit

◦ CodeNameSpace Imports Types

Members Event Field Method

Statements Expression

Property

53

Clone AnaysisClones & Dr. Kraft’s Tool

54

3 Types of Clones (Definition of Similarity):◦ Type 1: An exact copy without modifications

(except for whitespace and comments)

◦ Type 2: A syntactically identical copy Only variable, type, or function identifiers have

been changed

◦ Type 3: A copy with further modifications Statements have been changed, reordered, added,

or removed

Clones Types

55

Multi-Language Clone Detection◦ Cutting Edge of Research

Preliminary Research◦ Dr. Kraft and Students at UAB

C# and VB. Publication

Nicholas A. Kraft, Brandon W. Bonds, Randy K. Smith: Cross-language Clone Detection. SEKE 2008: 54-59

◦ Utilizes Mono Parsers C# VB

Clone Research

56

Performs Comparisons of Code Files

For each File, a CodeDOM tree is tokenized

Uses Levenshtein Distance Calculation◦ Minimum number of edits needed to transform one

sequence into the other

Distances Calculated◦ Distance determines Probability of a Clone

Dr. Kraft Clone Analysis

57

Dr. Kraft Application

58

Limitations Only does file-to-file comparisons

◦ Does not detect clones in same source file

Can only detect Type 1 and some Type 2 clones

Not very efficient (brute force)

59

Add Support for Same File Clone Detection

Add Support for Type 3 Clone Detection◦ Requires more Research

Provide a more efficient clone analysis algorithm

Enhancements

60

TestingWhite Box & Black Box Testing

61

White Box Testing: ◦ Unit Testing

Black Box Testing:◦ Production Rule Testing

Allows us to test the robustness of our engine because we can force rule production errors.

Regression Testing Automated

◦ Functional Testing

Testing Our Project

62

Unit Testing

63

Production Rule Test Input File Example

64

Functional Tests

65

MetricsProject Metrics

66

As of Nov 8, 2010 SLOC:

◦ CS666_Client = 553 lines◦ CS666_Core = 114 lines◦ CS666_CppParser = 117 lines◦ CS666_CsParser = 1678 lines◦ CS666_JavaParser = 3350 lines◦ CS666_LanguageSupport = 48 lines◦ CS666_UnitTests = 3384 lines

Total = 9244 lines (including unit tests)

SLOC For Our Project

67

DemonstrationDemonstration of our progress.

68

Demonstration These are the things we would like to show

you today:◦ GUI work◦ Project setup

Save project Load project

◦ Loading of source code◦ Parsing of source code◦ Translation of source code

69

Team CollaborationTeam 2 & Team 3

70

Team Collaboration Due to Team 3’s team size, we have taken

responsibility of gathering & sharing grammars.

Team 3 has the responsibility of the C++ Parsing.

Both Teams will…◦ Use the same grammars & engines

We will both have limitations based on this. Ex: JAVA grammar is based off 1.4 -> we are limited to

using JAVA 1.4◦ Test the same grammars & engines

We will have two test beds. 71

Team Collaboration Both teams met Monday (11-8-10) after

class and performed the required Pair Programming.

Current Status:◦ Team 2

All project source code has been made available.

We are researching and working to update the Java and C# grammars.

◦ Team 3 Team 3 is working on C++ parsing.

Looking into other parser, ELSA.

72

Path ForwardCurrent Status & Path Forward for Next Semester

73

Iteration 1: Parsing -> 85%◦ Completed parsing for Java & C#◦ No parsing for C++

But we have a foundation and design to start from. Iteration 2: Translation to CodeDOM -> 60%

◦ We have the foundation and design completed.◦ Now, it is a matter of turning the crank for the

languages. Iteration 3: Clone Analysis -> 30%

◦ Ported majority of Dr. Kraft’s student project code.◦ Started focusing on the GUI

Where we stand…

74

Task Understanding Three Step Process• Step 1 Code Translation

• Step 2 Clone Detection

• Step 3 Visualization

Source Files Translator Common

Model

Common Model Inspector Detected

Clones

Detected Clones UI Clone

Visualization

75

Schedule

76

Our next step is to re-evaluate where we currently stand.◦ Revisit Release Plan

Pull in Software Studio I work that was not completed.

◦ Revisit User Stories◦ Start off strong with unit tests not completed.

Path Forward

77

http://www.extremeprogramming.org/map/iteration.html

presentation 7 summary

Documents

crosslanguage clones

structure of source

translated source code

load source code

code translation step

clone detection step

clonespersists clones

common software