Scalable Clone Detection and Elimination for Erlang Programs
Huiqing Li, Simon Thompson
University of KentCanterbury, UK
Overview
Erlang
Wrangler
Clone detection
Clone elimination
Case studies
Conclusions and future work
Erlang• Weakly typed functional programming language.
• Built-in support for concurrency, distribution and fault-
tolerance.
• Some eccentricities: multiple binding occurrences,
bound variables in patterns, multiple usages of atoms,
side-effects, .... %% Factorial in Erlang. -module (fac).
-export ([fac/1]).
fac(0) -> 1; fac(N) when N > 0 -> N * fac(N-1).
Wrangler
Basic refactorings: structural, macro, process and test-framework related
Clone detection+ removal
Improve modulestructure
Clone Detection
Clone Detection
• The Wrangler clone detector
– Report clone classes whose members are
identical or similar
– No false positives
– High recall rate
– Scalable.
X+4 Y+5X+4 Y+5
What is ‘identical’ code?
variable+number
Identical if values of literals and variables
ignored, but respecting binding structure.
(X+3)+4 4+(5-(3*X))
What is ‘similar’ code?
X+Y
The anti-unification gives the (most specific)
common generalisation.
Similarity = min( , , )||(X+3)+4||||4+(5-(3*X))||
||X+Y|| ||X+Y||
Clone Detection
• All clones in a project meeting the threshold
parameters.
• Thresholds:
– minimum number of expressions,
– minimum number of tokens,
– minimum number of duplications,
– maximum number of new parameters, and
– minimum similarity score.
Clone result with threshold values: 1, 40, 2, 4, 0.8:
Clone result with threshold values: 3, 20, 2, 2,0.8:
Implementation
Implementation
• Clone detection in an incremental way.
– Initial clone detection.
– Incremental clone detection.
• AST-based two-phase clone detection.
Parse program, annotate and serialise AST
Generalise and hash expression
Clone detection using generalised suffix tree
Examination of clone candidates using anti-unification
Source Erlang programs
Serialised AAST
Hashed expression sequences
Initial clone candidates
Final clones
The Initial Detection Algorithm
• Bypasses the Erlang pre-processor;
• Location information included In AST;
• Static semantic information added to AST
• AAST traversed, and expression sequences collected.
• Bypasses the Erlang pre-processor;
• Location information included In AST;
• Static semantic information added to AST
• AAST traversed, and expression sequences collected.
• Capture structural similarity between expressions while keeping a structural skeleton of the original;
• Replace certain substrees with a placeholder, but only if sensible to do so.
• Each expression statement is hashed and mapped to an integer; therefore each expression sequence is mapped to a sequence of integers.
• Capture structural similarity between expressions while keeping a structural skeleton of the original;
• Replace certain substrees with a placeholder, but only if sensible to do so.
• Each expression statement is hashed and mapped to an integer; therefore each expression sequence is mapped to a sequence of integers.
• Check a candidate clone class for anti-unification, and will return none, one or more clone classes;
• Generation of anti_unifier function;
• Generation of application instances.
• Check a candidate clone class for anti-unification, and will return none, one or more clone classes;
• Generation of anti_unifier function;
• Generation of application instances.
The Initial Detection Algorithm
• Designed with incremental clone detection in
mind.
– Use relative locations, every function starts from
location {1, 1};
– Intermediate information cached: AAST, Static
semantic information, hash information, clone
table.
The Incremental Detection Algorithm
• Follow the same steps as the initial detection
algorithm, but reuse and incrementally update
the information cached from the previous run
of the clone detection.
• Take a function, instead of a file, as a unit to
track changes.
• Track the change of clones, mark each clone
class as new, unchanged, change+, changed-,
or change+- .
Clone Elimination
• Fully automatic clone elimination not desirable in
practice.
– Choice of clones to remove.
– functionality of the clone needs to be examined.
– the anti-unification function of a clone class, and its
parameters need to be renamed.
– A host module for the anti-unification function needs
to be selected.
Clone Elimination with Wrangler• Copy and paste the anti_unification function to an proper
Erlang module.
• Modify the anti_unification function is necessary.
• Rename function name.
• Rename variable names.
• Re-order function parameters.
• Apply ‘fold expressions against a function definition’ to
the new function.
Case Study 1
Incremental vs. Standalone Clone Detection
Case Study 2
SIP case study
Session Initiation Protocol
SIP message processing allows rewriting rules to transform messages.
SIP message manipulation (SMM) is tested by smm_SUITE.erl, 2658 LOC.
Clone detection
Clone detection
Reducing the case study
Step1 2658 6 2218 11 2131
2 2342 7 2203 12 2097
3 2231 8 2201 13 2042
4 2217 9 2183 … …
5 2216 10 2149
Case Study 3
Conclusions
• Efficient clone detection on medium-sized projects.• Possible to improve code using these techniques, but only with expert involvement.• A mechanism for clone detection to contribute to the daily reports from incremental nightly builds; case-study for this with LambdaStream.
Future Work
• To extend the tool to detect expression sequences which are similar up to insertion, or deletion of some expressions.• To check client code against libraries.
http://www.cs.kent.ac.uk/projects/wrangler/
Thank you!