detecting clones across microsoft .net programming languages (wcre2012)

25
Detecting Clones across Microsoft .NET Programming Languages Farouq Al-omari Iman Keivanloo Chanchal K. Roy Juergen Rilling Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keiv Contact: [email protected] his is not the original version given in the WCRE 2012 conference (no animation etc.)

Upload: imanmahsa

Post on 22-May-2015

1.531 views

Category:

Education


0 download

DESCRIPTION

This presentation is given in Working Conference on Reverse Engineering (WCRE 2012). The paper title is: "Detecting Clones across Microsoft .NET Programming Languages" Abstract: The Microsoft .NET framework and its language family focus on multi-language development to support interoperability across several programming languages. The framework allows for the development of similar applications in different languages through the reuse of core libraries. As a result of such a multi-language development, the identification and traceability of similar code fragments (clones) becomes a key challenge. In this paper, we present a clone detection approach for the .NET language family. The approach is based on the Common Intermediate Language, which is generated by the .NET compiler for the different languages within the .NET framework. In order to achieve an acceptable recall while maintaining the precision of our detection approach, we define a set of filtering processes to reduce noise in the raw data. We show that these filters are essential for Intermediate Languagebased clone detection, without significantly affecting the precision of the detection approach. Finally, we study the quantitative and qualitative performance aspects of our clone detection approach. We evaluate the number of reported candidate clone-pairs, as well as the precision and recall (using manual validation) for several open source cross-language systems, to show the effectiveness of our proposed approach.

TRANSCRIPT

Page 1: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

Detecting Clones across Microsoft .NET Programming Languages

Farouq Al-omari Iman Keivanloo Chanchal K. Roy

Juergen Rilling

Working Conference on Reverse Engineering, Canada, Kingston 18 October 2012 – I. Keivanloo

Contact: [email protected]

This is not the original version given in the WCRE 2012 conference (no animation etc.)

Page 2: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

Mergesort

Clones (Mergesort)

Mergesort Mergesort

Mergesort

The C# planet

Other Planets

Page 3: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

3

Clone Detection across LanguagesGeneral Solution

• C#

• VB.NET•

• J#

• F#

• COBOL (.NET)

• Java

Intermediate Language (IL)(low level)

Compilation

The solution is to use this (instead of dealing with several languages)

Page 4: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

4

Clone Detection across Languages using ILIs there any chance to work?

• Up to 3 times more cloned fragment detected using IL

Dataset

Input Data TypeCIL Source Code

# Clone Class

# Clone Fragment

# Clone Class # Clone Fragment

ASXGUI 9 393 69 261

Mono 37 4373 369 1523

Page 5: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

5

Clone Detection across Languages using ILObserved Challenges (using an example)

Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub

static void main(string[] args){int x=10;if(x<0)x++;elseconsole.WriteLine ("Positive number");}

VB.NET C#

Page 6: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

6

Clone Detection across Languages using ILObserved Challenges (using an example)

Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub

static void main(string[] args){int x=10;if(x<0)x++;elseconsole.WriteLine ("Positive number");}

VB.NET C#

IL from VBVB IL from C#C#

Page 7: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

7

Clone Detection across Languages using ILObserved Challenges

Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub

static void main(string[] args){int x=10;if(x<0)x++;elseconsole.WriteLine ("Positive number");}

VB.NET C#

Observed Challenges1- Larger unpredictable size at IL level [Keivanloo IWSC’12]

2- Higher dissimilarity at IL level

Page 8: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

8

Observed Challenges #2: High DissimilarityNoise

• Sample IL

IL_000c: ldloc.0 IL_000d: ldc.i4.1 IL_000e: add.ovf IL_000f: stloc.0 IL_0010: br.s IL_0024 IL_0012: nop IL_0013: ldstr "Positive number" IL_0018: call void [mscorlib]System.Console::WriteLine(string)

Major noise types:• Line numbers• Pointers to line number• Push, Pop …• Detailed Data Type data

Page 9: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

9

Clone Detection across Languages using ILThe Core Solution

• The Challenge: Noise• Solution: Data cleansing (filtering noises)• Why? (Answer: to increase recall)

Source Code IL + noise

Filters

IL - noise

Page 10: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

10

Filters for noise reduction

Our Filter Set

BeforeFiltering

AfterFiltering

ExampleDescription

Filter 1 IL_0003: stloc.0 stloc.0 IL_0003 (instruction address)Filter 2 brtrue.s IL_0015 brtrue.s The IL_0015 address of the

branch destinationFilter 3 ldarg 3

starg 1ldargstarg

The value 3&1 represent argument number

Filter 4 ldc.i4.s 10 ldc.i4.s 10 is the number (pushed to the stack)

Filter 5 ldstr "Positive number" ldstr “positive number” is the printed string constant

Filter 6 stloc 7 stloc 7 represents variable indexFilter 7 ldc.i4.s 10 ldc i4 represent the int32 data

type in CIL and s for Short Filter 8 IL_0011: add

IL_0012: stloc.0IL_0013: br.s IL_0020IL_001a: call void [mscorlib]System.Console::WriteLine (string)

addstlocbrcall

Note that Filter 8 is just a nick name. Refer to the Filter 8 description section for more details.

Page 11: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

11

Clone Detection across Languages using ILFiltering Advantage: Recall Improvement

Sub Main() Dim x As Integer x = 10 If x < 0 Then x += 1 Else Console.WriteLine("Positive number") End IfEnd Sub

VB.NET C#Before Filtering Noises:~50% similarity

After:~90% similarity

Page 12: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

12

Disadvantage of Noise reductionDanger!

• Data Loss• What if we remove important data during data cleansing• Might mislead the detection by

making non-cloned pairs identical Possible negative effect on Precision

Filtering Color Data

Page 13: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

13

RQ: Are They (Filters) Dangerous?Evaluation Preparation

1. Filter Contribution Formula:

2. Dataset preparation:– Controlled dataset (iText.NET J#) 25 pairs * 3 Lang.

1. The Cloned Dataset (VB-C#, VB-J#, and C#-J#)2. The Noncloned Dataset (VB-C#, VB-J#, and C#-J#)

Page 14: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

14

RQ: Are They (Filters) Dangerous?Filter Contribution - Study #1

• Are they harmful? (The answer is NO - based on following graphs, filters do not remove similar amount of data from actual clones vs. NONcloned code fragments)

A strong threshold for the Judge to decide

0.3 0.2

Cloned Dataset NonCloned Dataset

Page 15: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

15

RQ: Are They (Filters) Dangerous?Filter Contribution - Study #2

• Are they useful?(The answer is YES - based on the given figure, our filters help to

discriminate among actual clones and NONcloned fragments, therefore it is possible to separate them with high confidence with the chosen threshold)

Page 16: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

16

RQ: Are They (Filters) Dangerous?Filter Contribution - Study #3

Final Conclusion:

Filters contribute to discriminate between cloned and noncloned fragments

• Does filtering make actual clone-pairs and nonclonedpairs similar? (we used Chernoff faces – glyphs, to see if filters make noncloned pairs similar to cloned code. Each face represents a pair. As you can see, faces in group A are different from Group B in most cases)

Page 17: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

17

An Interesting Unexpected DiscoveryLanguage-dependency!!!

Corresponding faces in each group are not similar, while all of them are

extracted from single language (IL). Specially look at C#-J# faces, all of them

are different from other groups. This is an interesting discovery that the original

high-level programming languages affect similarity at the IL level

Page 18: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

18

Clone Detection across Languages using ILOur Clone Detection Framework

Input: .NET Code

Source Code

MS .NET

EXE & DLL

CIL Manipulation for Clone Detection

Proposed Filtering Mechanism

Clone Detection Algorithms

SimHash-based(from SimCad)

Levenshtein Distance-based

Clone Analysis

Clone Clusters

Merging

Source Code Mapping

Reporting

Report (CIL)

Report (Src Code)

LCS-based(from NiCad)

IlDasm.exe

CIL (plain text)

Page 19: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

language File LOC MethodASXGUI 2.5 VB.NET 47 32,594 303ASXGUI 3.0 C# 19 2088 78

language File LOC MethodMono 2.10 VB.NET 375 - -Mono 2.10 C# 57 - -Total 432 - 4998

language File LOC MethodiText C# - - -iText.NET J# - - -Total 2.5 K 600 K

4th Dataset: iText.NET dataset from 1st case studyWe used part of iText.NET library to create our last dataset.This dataset contains source code related iText.NET API usagewritten in three languages (C#, J#, and VB.NET). This featuremakes the dataset an important resource for our study since itallowed us to create a small (75 clone pairs) but controlleddataset (i.e., all actual clones are aligned, tagged and known inthe cross-language), creating a unique oracle for furtheranalysis. We use this oracle to obtain precise recall andprecision measures, since the number of actual clones isknown. This is contrast to the other datasets, where recall andprecision measure cannot be computed as precisely, since theactual number of clone-pairs is unknown

19

The Selected Datasets for Performance Evaluation

Page 20: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

20

Clone Detection across Languages using ILOur Clone Detection Framework Performance

Pay attention to changes within

0.6 … 0.8

Page 21: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

21

Clone Detection across Languages using ILOur Clone Detection Framework

• 2K clone-pair manually investigated

0.6 Normal0.7 High0.8 Extreme

PrecisionThe optimum, considering the trade-offbetween precision and recall, was achieved using Levenshtein Distance-based comparison with the High threshold (80% TP)

Recall(iText.NET API) 76% using High threshold between three languages (C#, J#, and VB.NET).

TP = {E and S}

Page 22: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

22

private static string filename_nodir(string name) { int slash = -1, len = name.Length; for (int i = 0; i < len; i++) { string sub = name.Substring(i, 1); if (sub == "\\" || sub == "/") slash = i; } slash++; return name.Substring(slash, len - slash); }

Function Filename_Nodir() As String Dim intFileName As Integer, intSlash As Integer, strFilename As String strFileName = editvid.video For intFilename = 1 To len(strFileName) If mid(strfilename, intfilename, 1) = "\" Or mid(strfilename, intfilename, 1) = "/" Then intslash = intFilename End If Next Return mid(strFileName, intSlash + 1, len(strFilename) - intSlash) End Function

*The matching algorithm was limited to the content available within the boxes (it was NOT aware of same method names)

C#VB

.NET

An Interesting CloneDetected by Our Approach

Page 23: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

23

Summary

• The first comprehensive research focusing on, (1) .NET clone detection, (2) across programming languages, and (3) using Intermediate Language

• Identified challenges in cross language clone detection + IL

Input: .NET Code

Source Code

MS .NET

EXE & DLL

CIL Manipulation for Clone Detection

Proposed Filtering Mechanism

Clone Detection Algorithms

SimHash-based(from SimCad)

Levenshtein Distance-based

Clone Analysis

Clone Clusters

Merging

Source Code Mapping

Reporting

Report (CIL)

Report (Src Code)

LCS-based(from NiCad)

IlDasm.exe

CIL (plain text)

Page 24: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

Related Publication

Iman Keivanloo, Chanchal K. Roy, Juergen Rilling,

“Java Bytecode Clone Detection via Relaxation on Code Fingerprint and Semantic Web Reasoning,”

6th International Workshop on Software Clones (IWSC), 2012.

Page 25: Detecting Clones across Microsoft .NET Programming Languages (WCRE2012)

25

ANY QUESTION?Contact: [email protected]