cs590 z matching program versions

21
CS590 Z Matching Program Versions Xiangyu Zhang

Upload: york

Post on 22-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

CS590 Z Matching Program Versions. Xiangyu Zhang. Problem Statement. Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P. Static mapping - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CS590 Z  Matching Program Versions

CS590 Z

Matching Program Versions

Xiangyu Zhang

Page 2: CS590 Z  Matching Program Versions

CS590Z

Problem Statement

Suppose a program P’ is created by modifying P. Determine the difference between P and P’. For an artifact c’ in P’, decide if c’ belongs to the difference, if not, find the correspondence of c’ in P.

• Static mapping• Non-trivial

Name comparison? What if

• Clone analysis, comparison checking

Page 3: CS590 Z  Matching Program Versions

CS590Z

Motivations

Validate compiler transformations

Facilitate regression testing

Reverse obfuscation

Information propagation

Debugging

Code plagiarism detection

Information Assurance

Page 4: CS590 Z  Matching Program Versions

CS590Z

Approaches

Static Approaches• Entity name based• String based (MOSS)• AST based (DECKARD)• CFG based (JDIFF)• PDG based (PDIFF)• Binary based (BMAT)• Log based (editor plugin, comparison checking)

Dynamic Approaches (not today)

Page 5: CS590 Z  Matching Program Versions

CS590Z

Static Approaches

Entity name matching• Model a function/field as tuples• Coarse grained matching

String matching• Diff (CVS, Subservion)• Longest common subsequence (LCS)

Available operations are addition and deletion Matched pairs can not cross one another Programs are far more complicated than strings

Copy, paste, move• CP-Miner (scale to linux kernel clone detection)

Frequent subsequence mining

Page 6: CS590 Z  Matching Program Versions

CS590Z

MOSS

Code plagiarism detection• It also handles other digital contents

Challenges• White space (variable name)• Noise (“the”, “int i”);• Order scrambling (paragraph reorders)

Problem statement• Given a set of documents, identify substring matches that

satisfy two properties: If there is a substring match at least as long as the guarantee threshold

t, then this match is detected; Do not detect any matches shorter than the noise threshold, k.

Page 7: CS590 Z  Matching Program Versions

CS590Z

MOSS

k-gram• A continuous substring of length k

Page 8: CS590 Z  Matching Program Versions

CS590Z

MOSS

Incremental hashing• Hashing strings of length k is expensive for large k.• “rolling” hash function

The (i+1)th k-gram hash = F (the ith k-gram hash, …)

Page 9: CS590 Z  Matching Program Versions

CS590Z

MOSS

Fingerprint selection• A subset of hash values

• Our goals: find all matching substrings >t; ignore matchings <k)

• One of every tth hash values• 0 mod p

Page 10: CS590 Z  Matching Program Versions

CS590Z

MOSS

Winnowing• Observation: given a sequence of hashes h1,…hn, if n>t-k,

then at least one of the hi must be chosen• Have a sliding window with size w=t-k+1• In each window select the minimum hash value, break ties

by select the rightmost occurrence.

Page 11: CS590 Z  Matching Program Versions

CS590Z

MOSS

Algorithm• Build an index mapping fingerprints to locations for all

documents.• Each document is fingerprinted a second time and the

selected fingerprints are looked up in the index; this gives the list of all matching fingerprints for each document.

• Sort (d,d1,fx), (d, d2,fy) by the first two elements. • Matches between documents are rank-ordered by size

(number of fingerprints)

Page 12: CS590 Z  Matching Program Versions

CS590Z

MOSS

Advantages• Guarantee to detect any >t substring matches

Limitations• Minor edits fail MOSS.

x= a*b + c vs. z= c + a*b• Insertion, deletion

Page 13: CS590 Z  Matching Program Versions

CS590Z

AST based matching

[YANG, 1991, Software Practice and Experience]• Given two functions, build the ASTs• Match the roots• If so, apply LCS to align subtrees• Continue recursively

Fragile

Page 14: CS590 Z  Matching Program Versions

CS590Z

DECKARD (ICSE 2007)

Page 15: CS590 Z  Matching Program Versions

CS590Z

DECKARD

Advantages• Scalability• Insensitive to minor structural changes such as reordering,

insertion, deletion

Limitations• Structural similarity only• Insertion that incurs structure change.

Page 16: CS590 Z  Matching Program Versions

CS590Z

CFG matching

Hammock graph (JDIFF ,ASE 2004)• Match classes by names• Match fields by types• Match methods by signatures• Match instruction in methods by hammock graphs

A hammock is a single entry single exit subgraph of a CFG.

Page 17: CS590 Z  Matching Program Versions

CS590Z

CFG matching

Pros• Orthogonal

Can be combined with other matching techniques• Simple

Cons• Coarse grained matching only

Not good at clone detection• In case of code transformation

Page 18: CS590 Z  Matching Program Versions

CS590Z

Semantic Based Matched

Using PDG (SAS’01)

Page 19: CS590 Z  Matching Program Versions

CS590Z

Semantic Based

Page 20: CS590 Z  Matching Program Versions

CS590Z

Semantic Based

Pros• Non-contiguous, intertwined, reordered• Insensitive to code transformations.

Cons• Scalability

Points-to analysis• Starting from a matching pair seems to be a problem

Page 21: CS590 Z  Matching Program Versions

CS590Z

Wrap Up

For clone detection• Maybe structural / text similarity is a good idea

For whole program matching / method matching with code transformations

• Semantic based is more appropriate

Scalability • PDG < CFG | AST < STRING < NAME