pathways to technology transfer and adoption: achievements and challenges

79
Pathways to Technology Transfer and Adoption: Achievements and Challenges Dongmei Zhang Microsoft Research Asia Tao Xie North Carolina State University ICSE 2013 SEIP Mini-Tutorial May 23, 2013 [email protected] [email protected]

Upload: tao-xie

Post on 05-Dec-2014

1.186 views

Category:

Technology


1 download

DESCRIPTION

Dongmei Zhang and Tao Xie. Pathways to Technology Transfer and Adoption: Achievements and Challenges. In Proceedings of the 35th International Conference on Software Engineering (ICSE 2013), Software Engineering in Practice (SEIP), Mini-Tutorial, San Francisco, CA, May 2013. http://people.engr.ncsu.edu/txie/publications/icse13seip-techtransfer.pdf

TRANSCRIPT

Page 1: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Pathways to Technology Transfer and Adoption Achievements and Challenges

Dongmei Zhang

Microsoft Research Asia

Tao Xie

North Carolina State University

ICSE 2013 SEIP Mini-Tutorial

May 23 2013

taoxiegmailcomdongmeizmicrosoftcom

Successful Samples Research Practice

ICSE 2013 SEIP 2

hellip

MSR SAGE

ASTREacuteE

Statechart

MSRA MSRA

SPIN

ACM SIGSOFT Impact Project

httpwwwsigsoftorgimpact

Goals of the Impact Projectbull Scholarly objective case-based evaluation

bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise

bull Community building

bull Prospective for future research investment

bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice

(there are other measures of research success)

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

An Argument ResearchProduct Timing SCM

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

Impact Trace Graph Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 2: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Successful Samples Research Practice

ICSE 2013 SEIP 2

hellip

MSR SAGE

ASTREacuteE

Statechart

MSRA MSRA

SPIN

ACM SIGSOFT Impact Project

httpwwwsigsoftorgimpact

Goals of the Impact Projectbull Scholarly objective case-based evaluation

bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise

bull Community building

bull Prospective for future research investment

bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice

(there are other measures of research success)

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

An Argument ResearchProduct Timing SCM

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

Impact Trace Graph Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 3: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ACM SIGSOFT Impact Project

httpwwwsigsoftorgimpact

Goals of the Impact Projectbull Scholarly objective case-based evaluation

bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise

bull Community building

bull Prospective for future research investment

bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice

(there are other measures of research success)

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

An Argument ResearchProduct Timing SCM

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

Impact Trace Graph Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 4: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Goals of the Impact Projectbull Scholarly objective case-based evaluation

bull Deliverablesbull peer-reviewed papersbull presentation materials and outreach activitiesbull expertise

bull Community building

bull Prospective for future research investment

bull Lessons learned for ldquosuccessfulrdquo researchbull but only with respect to transfer into practice

(there are other measures of research success)

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

An Argument ResearchProduct Timing SCM

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

Impact Trace Graph Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 5: Pathways to Technology Transfer and Adoption: Achievements and Challenges

An Argument ResearchProduct Timing SCM

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

Impact Trace Graph Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 6: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Impact Trace Graph Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 7: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 8: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 9: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ICSE Papers Industry vs Academia

Sourcecopy Carlo Ghezzi

OSDI 2008 26 vs xSE Developers Programmers Architects Among All Attendees

ICSM 11 KeynoteICSE 09 Keynote

MSR 12 KeynoteMSR 11 Keynote

SCAM 12 Keynote

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 10: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 11: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 12: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Redwine and Riddle Study (1985)

bull From idea to ldquothe point it can be popularized and disseminated to the technical community at largerdquobull Worst case 23 yearsbull Best case 11 yearsbull Mean 17 years

bull75 years from developed technology to wide availability

SourcecopyS L Pfleeger

Sam Redwine Jr William Riddle Software Technology Maturation In Proc ICSE 1985

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 13: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in products

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 14: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Technology Maturation Middleware

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

15-20 years between first

publication of an idea and widespread availability in productsShall we just stay in our comfort zone

to wait for 15-20 years for our research to (or not to) produce

practice impact How about the research that we did

15-20 years ago[Caveat donrsquot forget the need of long-termblue-sky research]

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 15: Pathways to Technology Transfer and Adoption: Achievements and Challenges

NSF Workshop on Formal Methods

bull Goal to identify the future directions in research in formal methods and its transition to industrial practice

bull The workshop brought together researchers and identified primary challenges in the field both foundational infrastructural and in transitioning ideas from research labs to developer tools

httpgotoucsdedu~rjhalaNSFWorkshop

Recently related fields (eg formal methods) have already looked into transitioning research to industrial practice Time for us to do too

December 2012

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 16: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 17: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Researcherrsquos View -SCM Impact Study Findings

bullResearchers tend to consider thathellipbull precedence

bull concepts

bull prototypes

bull are sufficient as impact and ignorehellipbull efficiency

bull usability

bull reliability

bulldismissing them as ldquoengineering common senserdquo

SourcecopyA WolfhttpwwwsigsoftorgimpactdocsImpactWolfBCS2008pdf

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 18: Pathways to Technology Transfer and Adoption: Achievements and Challenges

A Researchers Observation in HCI Research Community

bull ldquoThe reviewers simply do not value the difficulty of building real systems and how hard controlled studies are to run on real systems for real tasks This is in contrast with how easy it is to build new interaction techniques and then to run tight controlled studies on these new techniques with small artificial tasksrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 19: Pathways to Technology Transfer and Adoption: Achievements and Challenges

bull ldquoThis attitude is a joke and it offers researchers no incentive to do systems work Why should they Why should we put 3-4 person years into every CHI publication Instead we can do 8 weeks of work on an idea piece or create a new interaction technique and test it tightly in 8-12 weeks and get a full CHI paperrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 20: Pathways to Technology Transfer and Adoption: Achievements and Challenges

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 21: Pathways to Technology Transfer and Adoption: Achievements and Challenges

A Researchers Observation in HCI Research Community

bull ldquoWhen will this community wake up and understand that they are going to run out any work on creating new systems (rather than small pieces of systems) and cede that important endeavor to industryrdquo

bull ldquoWe are our own worst enemies I think we have been blinded by the perception that true scientific research is only found in controlled experiments and nice statisticsrdquo

ldquoI give up on CHIUISTrdquo by James Landayhttpdubfutureblogspotcom200911i-give-up-on-chiuisthtml SourcecopyJ Landay

Does our research community

have similar issues

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 22: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Evaluation of DesignPLldquoResearch in Programming Languagesrdquo

bull ldquoSince the 90s a considerable percentage of new languages that ended up being very popular were designed by lone programmers some of them kids with no research inclination some as a side hobby and without any grand goal other than either making some routine activities easier or for plain hacking funrdquo ndash PHP JavaScript Python Ruby

bull ldquoone striking commonality in all modern programming languages especially the popular ones is how little innovation there is in themrdquo

bull ldquoreverse the trend of placing software research under the auspices of science and engineering [alone]rdquo

Crista Lopes httptagidecomblog201203research-in-programming-languagesSourcecopyC Lopes

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 23: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Why Do Some Programming Languages Live and Others Die

bull Part of the problem is that language designers donrsquot always have practical objectives Therersquos a tendency in academics of trying to solve a problem when no one actually ever had that problem

bull Academics are so often determined to build a language that stands out from the crowd without thinking about whatrsquos needed to actually make it useful bull Sometimes designers fail with the simplest of things like

documentation for their language

bull Sometimes designers keep adding new features to a language and effectively overload the engineers who are trying to use it

httpwwwwiredcomwiredenterprise201206berkeley-programming-languages

Wiredcom

SourcecopyC Garling

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 24: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Industrial Evaluations= Real Adoption

bull Papers on industrial studiesevaluations on applying tools on industrial code who applybull Authors themselves instead of third parties

bull Non-target users (such as students)

bull Target users but not developers of the industrial code

bull Developers of the industrial code

bull Apply one-time (hitamprun) or continuous adoption

Need to value real adoption (eg in reviewing papers)

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 25: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 26: Pathways to Technology Transfer and Adoption: Achievements and Challenges

MS Academic Search ldquoPointer Analysisrdquo

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 27: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

23

ldquoDuring the past 21 years over 75 papers and 9 PhD theses have been published on pointer analysis Given the tones of work on this topic one may wonder ldquoHavent we solved this problem yet With input from many researchers in the field this paper describes issues related to pointer analysis and remaining open problemsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM Hind

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 28: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ldquoPointer Analysis Havenrsquot We Solved This Problem Yetrdquo [Hind PASTErsquo01]

24

Section 43 Designing an Analysis for a Clientrsquos Needs

ldquoBarbara Ryder expands on this topic ldquohellip We can all write an unbounded number of papers that compare different pointer analysis approximations in the abstract However this does not accomplish the key goal which is

to design and engineer pointer analyses that are useful for solving real software problems for realistic programsrdquo

Michael Hind Pointer analysis havent we solved this problem yet In Proc ACM SIGPLAN-SIGSOFT Workshop on Program Analysis for Software Tools and Engineering (PASTE 2001) SourcecopyM HindampB Ryder

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 29: Pathways to Technology Transfer and Adoption: Achievements and Challenges

MS Academic Search ldquoClone Detectionrdquo

Typically focusevaluate on intermediate steps (eg clone detection) instead of ultimate tasks (eg bug detection or refactoring) even when the field already grows mature with n years of efforts on

intermediate steps

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 30: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Some Success Stories of Applying Clone Detection [Focus on Ultimate Tasks]

26

Zhenmin Li Shan Lu Suvda Myagmar and Yuanyuan Zhou CP-Miner a tool for finding copy-paste and related bugs in operating system code In Proc OSDI 2004

MSRAXIAO

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu and Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice In Proc ACSAC 2012

httppatterninsightcom

httpwwwblackducksoftwarecom

httpresearchmicrosoftcomen-usgroupssa

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 31: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Mindset Changing is Needed for Our Community

bullNeed to get out of comfort zone

bullNeed to value (and pursue) ldquorealnessrdquo

bullNeed to aim for ultimate tasks

bullNeed to value (and pursue) tech readiness

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 32: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Example Dimensions of Tech Readiness

bull Scalability

bullComplexity

bullApplicability

bullUsability (human in the loop)

bullCost-Benefit Analysis

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 33: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ScalabilitybullAcademia

bull Rarely ask ldquoWhen scale is up will my solution still workrdquo

bull Tend to focus on small or toy scale problems

bullReal-world (eg search engine code analysis hellip)bull Often demand a scalable solution

bull Ideal sophisticated and scalable solutionbull But in practice simple solution tends to be scalable

(performance maintenance hellip)

bull Academia tend to value sophistication gt simplicity

bull Ex EchelonMS [SrivastavaThiagarajan ISSTArsquo02] Klee [Cadar et al OSDIrsquo08]

httpdlacmorgcitationcfmid=566187httpdlacmorgcitationcfmid=1855756

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 34: Pathways to Technology Transfer and Adoption: Achievements and Challenges

ComplexitybullAcademia

bull Tend to make assumptions to simplify problems or one at a time (indeed relaxing assumptions over time)

bull May not be able to assess the relevancefeasibility of assumptions in practice not consultwork w industry

bullReal-worldbull Often has high complexity violating these assumptions

bull Example OO Unit Test Generationbull Isolated simple classes Isolated complex data

structures Real world classes as focused by our recent work [Thummalapenta et al ESECFSErsquo09 OOPSLArsquo11]

httpdlacmorgcitationcfmid=2048083

httpdlacmorgcitationcfmid=1595725

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 35: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Applicabilitybull Academia

bull Tend to focus on a solution optimized for one of many situations (likely worse for others) vs comprehensive solution

bull May not enable to tell ahead of time whether a given case would fall into applicable scope of the solution

bull Real-worldbull Need a comprehensive solution that would work generally (at least

not compromising too much other situations)

bull Examplesbull Integration of our Fitnex in Pex [Xie et al DSNrsquo09]

bull Coverity [Bessey et al CACMrsquo10] vs MSRA XIAO [Dang et al ACSACrsquo12]PatternInsight

bull Industry adoption of open source tools

httpdlacmorgcitationcfmid=1646374httpresearchmicrosoftcompubs81089dsn09-fitnex5B15Dpdf

httpresearchmicrosoftcomjump175199

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 36: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Usabilitybull Academia

bull Tend to leave human out of loop (involving human makes evaluations difficult to conduct or write)

bull Tend not to spend effort on improving tool usability bull tool usability would be valued more in HCI than in SE

bull too much to include both the approachtool itself and usabilityits evaluation in a single paper

bull Real-worldbull Often has human in the loop (familiar IDE integration social effect

lack of expertisewillingness to write specshellip)

bull Examplesbull Agitar [Boshernitsan et al ISSTArsquo06] vs Daikon [Ernst et al ICSErsquo99]

bull Debugging user study [ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=302467httpdlacmorgcitationcfmid=1146258

httpdlacmorgcitationcfmid=2001445

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 37: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Are Automated Debugging [Research] Techniques Actually Helping Programmers

bull 50 years of automated debugging researchbull N papers only 5 evaluated with actual programmers

ldquo

rdquo[ParninampOrso ISSTArsquo11]

httpdlacmorgcitationcfmid=2001445

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 38: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Cost-Benefit Analysisbull Academia

bull Tend to focus on one or a few dimensions of measurement (eg analysis cost precision andor recall)

bull Real-worldbull Consider many dimensions of measurement

bull Cost eg human cost (inspecting false positives)

bull Benefit eg bug severity

bull Killer apps eg bull MSR SLAM Device driver verification

bull MSR SAGE Security testing of binaries [Godefroid et al NDSSrsquo08]

bull PatternInsightMSRA XIAO Known-bug detection

bull Example Google FindBugs Fixit [AyewahampPugh ISSTArsquo09]

httpresearchmicrosoftcomen-usprojectsslam

httpresearchmicrosoftcomen-usumpeoplepgpublic_psfilesndss2008pdfhttpdlacmorgcitationcfmid=1831738

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 39: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Industry Academia Collaboration

bullAcademia (research recognitions eg papers) vs Industry (company revenues)

bullAcademia (research innovations) vs Industry (likely involving engineering efforts)

bullAcademia (long-termfundamental research or out of box thinking) vs Industry (short-term research or work)

bull Industry problems infrastructures data evaluation testbeds hellip

bull Academia educating students hellip

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 40: Pathways to Technology Transfer and Adoption: Achievements and Challenges

MSRA Software Analytics Group

Mission

Utilize data-driven approach to help create highly

performing user friendly and efficiently developed

and operated software and services

Founded

May 2009

Group members

12

httpresearchmicrosoftcomen-usgroupssa

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 41: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Software Analytics

Software analytics is to enable software practitionersto perform data exploration and analysis in order to obtain insightful and actionable information for data-driven tasks around software and services

Dongmei Zhang Yingnong Dang Jian-Guang Lou Shi Han Haidong Zhang and Tao Xie Software Analytics as a Learning Case in Practice Approaches and Experiences In MALETS 2011httpresearchmicrosoftcomen-usgroupssamalets11-analyticspdf

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 42: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Research topics amp technology pillars

Microsoft Confidential

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 43: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Research Topics

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 44: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 45: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 46: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Research topics amp technology pillars

Microsoft Confidential

Software Development

Process

Software Systems

Software Users

Information Visualization

Analysis Algorithms

Large-scale Computing

Research Topics Technology Pillars

Vertical

Horizontal

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 47: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 48: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Connection to practice

MSR 2012 39

bull Software Analytics is naturally tied with software development practice

bull Getting real

RealData

RealProblems

RealUsers

RealTools

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 49: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Creating real impact

Code Clone Analysis [Dang et al ACSACrsquo12]

bull Detecting near-duplicated code

bull Released with Visual Studio 2012

StackMine [Han et al ICSErsquo12]

bull Performance debugging in the large

via mining millions of stack traces

bull Helping improve Windows performance

httpresearchmicrosoftcomjump175199httpdlacmorgcitationcfmid=2337241

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 50: Pathways to Technology Transfer and Adoption: Achievements and Challenges

httpresearchmicrosoftcomen-usnewsfeaturessoftwareanalytics-052013aspx

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 51: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Experience sharing

bull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 42

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 52: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Real world is not that prettyhellip

bull Data is incomplete and noisyhellipbull The scale of data is hugehellipbull We do not have all the time in the world to computehellipbull The machines are not powerful enoughhellipbull End users are ldquoimpatientrdquohellipbull Product teams are always busyhellipbull Product teams do not commit before seeing everything

workinghellipbull Product teams change plans and prioritieshellipbull Product teams speak ldquodifferent languagesrdquohellipbull More hellip

ICSE 2013 SEIP 43

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 53: Pathways to Technology Transfer and Adoption: Achievements and Challenges

What does ldquogetting realrdquo mean

ICSE 2013 SEIP 44

Making real impact

Building real technologies

Solving real problems

Software engineering is naturally tied with software development practice

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 54: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Technical readiness

bull Assumptions

bull Scalability

bull Complexity

bull Usability

bull Cost-Benefit Analysis

bull Walking last mile

ICSE 2013 SEIP 45

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 55: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Example project ndash XIAO

bull Token-based code clone analysis technique

bull Characteristics

bull Technology transfersbull Three-year journey fromVisual Studio 2012

bull Code clone search service within Microsoft

bull research to impact

ICSE 2013 SEIP 46

curren High tunability curren High scalability

curren High compatibility curren High explorability

Prototype development

Early adoptionTechnology

transfer

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 56: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Scalability

bull Four-step analysis process

bull Easily parallelizable based on source code partition

ICSE 2013 SEIP 47

Pre-processingCoarse

Matching

Fine MatchingPruning

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 57: Pathways to Technology Transfer and Adoption: Achievements and Challenges

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 58: Pathways to Technology Transfer and Adoption: Achievements and Challenges

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 59: Pathways to Technology Transfer and Adoption: Achievements and Challenges

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 60: Pathways to Technology Transfer and Adoption: Achievements and Challenges

What you tune is what you get

MSR 2012 48

bull Intuitive similarity metricbull Effective control of the degree of syntactical differences between two code

snippets

bull Tunable at fine granularitybull Statement similarity

bull of inserteddeletedmodified statements

bull Balance between code structure and disordered statements

for (i = 0 i lt n i ++)

a ++

b ++

c = foo(a b)

d = bar(a b c)

e = a + c

for (i = 0 i lt n i ++)

c = foo(a b)

a ++

b ++

d = bar(a b c)

e = a + d

e ++

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 61: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Explorability

ICSE 2013 SEIP 49

1 Clone navigation based on source tree hierarchy

2 Pivoting of folder level statistics

3 Folder level statistics

4 Clone function list in selected folder

5 Clone function filters

6 Sorting by bug or refactoring potential

7 Tagging

1 2 3 4 5 6

7

1 Block correspondence

2 Block types

3 Block navigation

4 Copying

5 Bug filing

6 Tagging

1

2

3

4

1

6

5

How to navigate through the large number of detected clones

How to quickly review a pair of clones

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 62: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Collaboration

bull Collaboration models

bull Communication

bull Champion in product teams

bull Getting engineering support

ICSE 2013 SEIP 50

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 63: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Collaboration models

ICSE 2013 SEIP 51

Pull

Push

Join

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 64: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Communication ndash getting connected

bull Reaching-out to practitioners

bull Understanding their business

bull Speaking practitionersrsquo languages

bull Finding out their pain pointsbull Understanding their scenarios

bull Experiencing their pain

bull Articulating their problems

ICSE 2013 SEIP 52

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 65: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Communication ndash forming partnership

bull Finding and defining shared goals

bull Setting the right expectation

bull Building a roadmap

bull Forming virtual team (creating an email alias)

bull Adopting a milestone approach

bull Conducting regular sync-up

ICSE 2013 SEIP 53

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 66: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Example project ndash XIAO

bull Tons of papers published in the past 10 years

bull 6 years of International Workshop on Software Clones (IWSC) since 2006

bull Dagstuhl Seminarbull Software Clone Management towards Industrial Application (2012)

bull Duplication Redundancy and Similarity in Software (2006)

bull No code clone analysis tools in MS

bull No product offering

ICSE 2013 SEIP 54

Source httpwwwdagstuhlde12071

Yingnong Dang Dongmei Zhang Song Ge Chengyun Chu Yingjun Qiu Tao Xie XIAO Tuning Code Clones at Hands of Engineers in Practice Proc ACSAC 2012

httpresearchmicrosoftcomjump175199

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 67: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Motivation

bull Copy-and-paste is a common developer behavior

bull A real tool widely adopted internally and externally

ICSE 2013 SEIP 55

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 68: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Reaching out (1)

bull Demonstrating XIAO at TechFest

bull Posting XIAO at internal website

bull Active ldquosellingrdquo to various teams

bull What we gainedbull Opportunities to run XIAO on different codebases and

produce rich results

bull Feedback to improve both algorithm and system

bull Expanded network

ICSE 2013 SEIP 56

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 69: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Reaching out (2)

bull What did not land well internallybull Wide interest but no concrete takers

bull Why no takersbull What exactly is the valuable proposition

bull Long way to go from code clones to bugs

bull High cost for code refactoring

bull Product prioritization

bull Lessons learnedbull Killer scenarios needed for value proposition

bull Security is a big stick

ICSE 2013 SEIP 57

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 70: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Potential 0day vulnerability disclosure

ICSE 2013 SEIP 58

Initial vulnerability reported in product A

Patch release of product B

Potential 0day attack

Security bulletin released

Similar vulnerability found in product B by attackers

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 71: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Tech transfer to MSRC

bull Search scenario vs detection scenariobull Code snippet as input

bull Much larger scale of codebases

bull Near-real-time response

bull Code clone search servicebull Indexed ~600 million LOC across multiple codebases

bull Deployed in used by and transferred to MSRC

bull Champion in MSRC worked with us all the waybull Providing feedback and update

bull Prompting within MSRC

ICSE 2013 SEIP 59

Microsoft Security Response Center

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 72: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Vulnerability investigation workflow

ICSE 2013 SEIP 60

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

Team A

MSRC

Manual amp ad hoc investigation

Code snippet

Team B

Team C

Code clones

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 73: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Vulnerability investigation workflow

ICSE 2013 SEIP 61

Clone search service

Completeness is the key Web service API for

automation

Code snippet

Code clones

Automated Investigation

Code snippet

Code clones

DesignImplementTest fix

Variants finding

Root cause investigation amp

source location

Issue reproducing

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 74: Pathways to Technology Transfer and Adoption: Achievements and Challenges

More secure Microsoft products

ICSE 2013 SEIP 62

Automated laborious manual efforts

Faster response time critical in security context

Code clone search service integrated into vulnerability investigation process of MSRC

Real security issues proactively identified

and addressed

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 75: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Example ndash MS security bulletin MS12-034Combined security update for Microsoft Office Windows NET Framework and Silverlight published Tuesday May 08 2012

3 publicly disclosed vulnerabilities and seven privately reported involved Specifically one is exploited by the Duqu malware to execute arbitrary code when a user opened a malicious Office document

Insufficient bounds check within the font parsing subsystem of win32ksys

Cloned copy in gdiplusdll ogldll (office) Silverlight and Windows Journal viewer

Microsoft Security Research amp Defense Blog about this bulletin

ldquoHowever we wanted to be sure to address the vulnerable code wherever it appeared across the Microsoft code base To that end we have been working with Microsoft Research to develop a ldquoCloned Code Detectionrdquo system that we can run for every MSRC case to find any instance of the vulnerable code in any shipping product This system is the one that found several of the copies of CVE-2011-3402 that we are now addressing with MS12-034rdquo

ICSE 2013 SEIP 63httpblogstechnetcombsrdarchive20120508ms12-034-duqu-ten-cve-s-and-removing-keyboard-layout-file-attack-surfaceaspx

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 76: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Transfer to Visual Studio (1)

bull Unsuccessful effortsbull Out-Of-Band (OOB) releasebull Power Tool

bull Two reorgs in Visual Studio

bull Lessons learnedbull No integration story felt like a ldquoseparaterdquo toolbull Not on the release path of VS

bull Accumulated assetsbull Solidified algorithm and systembull Trusted partners

bull One program manager in VSbull MSRA Innovation Engineering Group

ICSE 2013 SEIP 64

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 77: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Transfer to Visual Studio (2)

bull Third timersquos the charmbull Strong support from general manager of VSUbull Concrete scenarios identifiedbull Easy sell at VS 2012 planning meeting

bull Virtual teambull Researchers (MSRA SA)bull Developers (MSRA IEG VS)bull Program manager (VS)bull Tester (VS)

bull Active planning as part of VS 2012 release

bull Weekly sync-up

bull Timely feedback from VS partners

ICSE 2013 SEIP 65

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 78: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Benefiting developer community

ICSE 2013 SEIP 66

Searching similar snippets for fixing bug once

Finding refactoring opportunity

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79

Page 79: Pathways to Technology Transfer and Adoption: Achievements and Challenges

Summary

bull Mindset changing needed for community bull Get out of comfort zone

bull Value (and pursue) ldquorealnessrdquo

bull Aim for ultimate tasks

bull Value (and pursue) tech readiness

bull Experience sharing of successful tech-transfer on Software Analyticsbull Getting-real mindset

bull Technical readiness

bull Collaboration

ICSE 2013 SEIP 79