internet-scale real-time code clone search via multi-level indexing
DESCRIPTION
Paper Titile: "Internet-scale Real-time Code Clone Search via Multi-level Indexing"Conference: "WCRE 2011"Abstract:Finding lines of code similar to a code fragment across large knowledge bases in fractions of a second is a new branch of code clone research also known as real-time code clone search. Among the requirements real-time code clone search has to meet are scalability, short response time, scalable incremental corpus updates, and support for type-1, type-2, and type-3 clones. We conducted a set of empirical studies on a large open source code corpus to gain insight about its characteristics. We used these results to design and optimize a multi-level indexing approach using hash table-based and binary search to improve Internet-scale real-time code clone search response time. Finally, we performed an evaluation on an Internet-scale corpus (1.5 million Java files and 266 MLOC). Our approach maintains a response time for 99.9% of clone searches in the microseconds range, while supporting the aforementioned requirements.TRANSCRIPT
Internet-scale Real-time Code Clone Search via Multi-level Indexing
Iman Keivanloo
Juergen Rilling Philippe Charland
WCRE 2011 Working Conference on Reverse Engineering
Agenda
• Research Context
• Requirements and Objectives
• Characteristics of Source Code (step 1)
• Clone search Approach (step 2)
• Performance Evaluation
2 WCRE 2011
Research Context
• Internet-scale Code Search “is searching the Internet for source code to help solve a software development problem” [GV09]
– e.g. SE-CodeSearch [KE10] & Sourcerer [BR09]
• Challenge – Long response time (slow)
• Idea – Exploiting Clone Search
3
Code
SearchEngine
Clone
SearchEngine Partial Result-set Full Result-set Query
WCRE 2011
The Input and its Clones (Definition)
• The input is – (1) a code fragment and
– (2) a target line which matches the relevant functionality
4
Clones
Input
WCRE 2011
Research Motivation
Can Clone Search be successfully applied for Internet-scale Code Search?
5 WCRE 2011
Background
• Definition: Clone Search – “Finding lines of code similar to a code fragment”
– Clone Search vs. Clone Detection
– aka Real-time, Instant, just-in-time …
• Related Work
– Hummel et al. [ICSM10] 128-bit hash-based indexing
• Research Opportunities:
– false positive rate
– Speed & granularity
– Type-3 Clone
– SeClone [ICPC11] clustering and IR to group false and true positives
6 WCRE 2011
Requirements, Objectives, & Approach
• Requirements: • Scalability
• Short response time
• Scalable incremental corpus updates
• Type-1, type-2, and type-3 clone
• Objectives: • Scalability & Speed & Granularity
• Type-3 Clone
7
IJaDataset •~18,000 Projects
•1,500,000 unique Java classes
•~300 MLOC
•The largest inter-project Java
source code dataset for clone
search
•available online at http://aseg.cs.concordia.ca/seclone
WCRE 2011
Requirements, Objectives, & Approach (2)
• Requirements: • Scalability
• Short response time
• Scalable incremental corpus updates
• Type-1, type-2, and type-3 clone
• Objectives: • Scalability & Speed & Granularity
• Type-3 Clone
8
Statistical Analysis
Algorithmic approach
WCRE 2011
Granularity Effect on Clone Search
• Three Level Similarity (TLS): set of similar three-line code fragments
• First Level Similarity (FLS): single-line patterns
9
Granularity Effect on Clone Search (2)
• TLS groups with less than 2,000 members
10 WCRE 2011
Granularity Effect on Clone Search (3)
• Observation result:
– TLS distributes the candidates into 3.9 times more groups
– Its group size is 6 times smaller than FLS 11 WCRE 2011
Granularity Effect on Clone Search (3)
• Conclusion:
– TLS heuristic is practical for real-time clone search, as long as the outliers are handled properly
– Why?
• (1) each TLS group has 2.37 members on average
• (2) it distributes candidates in small-size groups
• (3) for each query, only one group must be evaluated
12 WCRE 2011
What Does an Outlier Pattern Look Like?
• Outlier Definition: patterns with more than 2,000 occurrences
• Observation result: • only ~1000 patterns out of 30M
• ~ 0.01% patterns
• Mostly insignificant code patterns
13
Sampling Effectiveness
• An alternative to address scalability and speed • e.g., Barbour et al.
• Observation: – Distinct pattern per file analysis
• Observation result: • 33% contains 91% of popular patterns
14
Internet-scale Real-time Code Clone Search via Multi-level Indexing
• Based on SeClone [ICPC11] architecture
• Advantages – Internet-scale & Speed
• 32-bit Hash values
– Type-3 clone • Multi-level indexing
– Customized for Internet-scale Code Search • Special transformation rule
15 WCRE 2011
Is a 32-bit Hash Code Strong Enough for Clone Search?
• Input data: – IJaDataset (300 MLOC)
• Evaluation criteria: – If two distinct sets strings got a similar key
• Hash function: – JDK standard hash function for String type
• Observation result: – 32-bit has code is strong enough
– Only 0.002% error rate • For example: Only 10 cases for same key for three distinct strings
16 WCRE 2011
Customized Transformation Rule
• Input Data: Koders one year query log – ~10M records
• Observation purpose: – Importance of method names
• Observation result: – 98% success rate vs. 69%
• Result interpretation: – Method names in this context are reliable source of information comparing
to class, variable names ...
• Observation conclusion: – They must be preserved during transformation to increase precision
17 WCRE 2011
Performance Evaluation (Response Time)
• Traditional Clone Detection • 21 minutes all clone-pairs
• Under 3 minutes (excluding outliers)
• Regular queries – 25 microseconds
• 99.99% queries – 900 microseconds
18 WCRE 2011
Conclusion
Step 1
• Studied characteristics of source code on the Internet
– unique patterns distribution (sampling application)
– Pattern frequencies (multi-level search)
– 32-bit hashing strength (code pattern)
– Outlier patterns
Step 2
• Designed an Internet-scale clone search
– Customized for code search (precision)
– Fine granularity
– Multi-level Indexing approach (Type-3 clone)
– Microsecond range response time (up to 10 times faster)
19
WCRE 2011
http://aseg.cs.concordia.ca/seclone
Question