internet-scale real-time code clone search via multi-level indexing

20
Internet-scale Real-time Code Clone Search via Multi-level Indexing Iman Keivanloo Juergen Rilling Philippe Charland WCRE 2011 Working Conference on Reverse Engineering

Upload: imanmahsa

Post on 28-Nov-2014

2.630 views

Category:

Technology


0 download

DESCRIPTION

Paper Titile: "Internet-scale Real-time Code Clone Search via Multi-level Indexing"Conference: "WCRE 2011"Abstract:Finding lines of code similar to a code fragment across large knowledge bases in fractions of a second is a new branch of code clone research also known as real-time code clone search. Among the requirements real-time code clone search has to meet are scalability, short response time, scalable incremental corpus updates, and support for type-1, type-2, and type-3 clones. We conducted a set of empirical studies on a large open source code corpus to gain insight about its characteristics. We used these results to design and optimize a multi-level indexing approach using hash table-based and binary search to improve Internet-scale real-time code clone search response time. Finally, we performed an evaluation on an Internet-scale corpus (1.5 million Java files and 266 MLOC). Our approach maintains a response time for 99.9% of clone searches in the microseconds range, while supporting the aforementioned requirements.

TRANSCRIPT

Page 1: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Internet-scale Real-time Code Clone Search via Multi-level Indexing

Iman Keivanloo

Juergen Rilling Philippe Charland

WCRE 2011 Working Conference on Reverse Engineering

Page 2: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Agenda

• Research Context

• Requirements and Objectives

• Characteristics of Source Code (step 1)

• Clone search Approach (step 2)

• Performance Evaluation

2 WCRE 2011

Page 3: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Research Context

• Internet-scale Code Search “is searching the Internet for source code to help solve a software development problem” [GV09]

– e.g. SE-CodeSearch [KE10] & Sourcerer [BR09]

• Challenge – Long response time (slow)

• Idea – Exploiting Clone Search

3

Code

SearchEngine

Clone

SearchEngine Partial Result-set Full Result-set Query

WCRE 2011

Page 4: Internet-scale Real-time Code Clone Search via Multi-level Indexing

The Input and its Clones (Definition)

• The input is – (1) a code fragment and

– (2) a target line which matches the relevant functionality

4

Clones

Input

WCRE 2011

Page 5: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Research Motivation

Can Clone Search be successfully applied for Internet-scale Code Search?

5 WCRE 2011

Page 6: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Background

• Definition: Clone Search – “Finding lines of code similar to a code fragment”

– Clone Search vs. Clone Detection

– aka Real-time, Instant, just-in-time …

• Related Work

– Hummel et al. [ICSM10] 128-bit hash-based indexing

• Research Opportunities:

– false positive rate

– Speed & granularity

– Type-3 Clone

– SeClone [ICPC11] clustering and IR to group false and true positives

6 WCRE 2011

Page 7: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Requirements, Objectives, & Approach

• Requirements: • Scalability

• Short response time

• Scalable incremental corpus updates

• Type-1, type-2, and type-3 clone

• Objectives: • Scalability & Speed & Granularity

• Type-3 Clone

7

IJaDataset •~18,000 Projects

•1,500,000 unique Java classes

•~300 MLOC

•The largest inter-project Java

source code dataset for clone

search

•available online at http://aseg.cs.concordia.ca/seclone

WCRE 2011

Page 8: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Requirements, Objectives, & Approach (2)

• Requirements: • Scalability

• Short response time

• Scalable incremental corpus updates

• Type-1, type-2, and type-3 clone

• Objectives: • Scalability & Speed & Granularity

• Type-3 Clone

8

Statistical Analysis

Algorithmic approach

WCRE 2011

Page 9: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Granularity Effect on Clone Search

• Three Level Similarity (TLS): set of similar three-line code fragments

• First Level Similarity (FLS): single-line patterns

9

Page 10: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Granularity Effect on Clone Search (2)

• TLS groups with less than 2,000 members

10 WCRE 2011

Page 11: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Granularity Effect on Clone Search (3)

• Observation result:

– TLS distributes the candidates into 3.9 times more groups

– Its group size is 6 times smaller than FLS 11 WCRE 2011

Page 12: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Granularity Effect on Clone Search (3)

• Conclusion:

– TLS heuristic is practical for real-time clone search, as long as the outliers are handled properly

– Why?

• (1) each TLS group has 2.37 members on average

• (2) it distributes candidates in small-size groups

• (3) for each query, only one group must be evaluated

12 WCRE 2011

Page 13: Internet-scale Real-time Code Clone Search via Multi-level Indexing

What Does an Outlier Pattern Look Like?

• Outlier Definition: patterns with more than 2,000 occurrences

• Observation result: • only ~1000 patterns out of 30M

• ~ 0.01% patterns

• Mostly insignificant code patterns

13

Page 14: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Sampling Effectiveness

• An alternative to address scalability and speed • e.g., Barbour et al.

• Observation: – Distinct pattern per file analysis

• Observation result: • 33% contains 91% of popular patterns

14

Page 15: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Internet-scale Real-time Code Clone Search via Multi-level Indexing

• Based on SeClone [ICPC11] architecture

• Advantages – Internet-scale & Speed

• 32-bit Hash values

– Type-3 clone • Multi-level indexing

– Customized for Internet-scale Code Search • Special transformation rule

15 WCRE 2011

Page 16: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Is a 32-bit Hash Code Strong Enough for Clone Search?

• Input data: – IJaDataset (300 MLOC)

• Evaluation criteria: – If two distinct sets strings got a similar key

• Hash function: – JDK standard hash function for String type

• Observation result: – 32-bit has code is strong enough

– Only 0.002% error rate • For example: Only 10 cases for same key for three distinct strings

16 WCRE 2011

Page 17: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Customized Transformation Rule

• Input Data: Koders one year query log – ~10M records

• Observation purpose: – Importance of method names

• Observation result: – 98% success rate vs. 69%

• Result interpretation: – Method names in this context are reliable source of information comparing

to class, variable names ...

• Observation conclusion: – They must be preserved during transformation to increase precision

17 WCRE 2011

Page 18: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Performance Evaluation (Response Time)

• Traditional Clone Detection • 21 minutes all clone-pairs

• Under 3 minutes (excluding outliers)

• Regular queries – 25 microseconds

• 99.99% queries – 900 microseconds

18 WCRE 2011

Page 19: Internet-scale Real-time Code Clone Search via Multi-level Indexing

Conclusion

Step 1

• Studied characteristics of source code on the Internet

– unique patterns distribution (sampling application)

– Pattern frequencies (multi-level search)

– 32-bit hashing strength (code pattern)

– Outlier patterns

Step 2

• Designed an Internet-scale clone search

– Customized for code search (precision)

– Fine granularity

– Multi-level Indexing approach (Type-3 clone)

– Microsecond range response time (up to 10 times faster)

19

WCRE 2011

Page 20: Internet-scale Real-time Code Clone Search via Multi-level Indexing

http://aseg.cs.concordia.ca/seclone

Question