tim menzies, directions in data science

Perspectives from “PROMISE”:lessons learned, issues raised

tim.menzies@gmail.com• 2002-2004: SE research chair, NASA• 2004

– “Lets start a repo for SE data”– Data mining in SE– Upload data used in papers

• 2013– 10 PROMISE conferences – 100s of data sets– 1000s of papers– Numerous journals special issues– Me: 1st and 3rd most cited papers in top SE journals (since 2007, 2009)

• 2014: Promise repo moving to NcState: tera-Promise

Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)

Tim Menzies, Jeremy Greenwald, Art Frank: Data Mining Static Code Attributes to Learn Defect Predictors. IEEE Trans. Software Eng. 33(1): 2-13 (2007)

http://promisedata.googlecode.com

Issues raised in my work on PROMISE:(note: more than “just” SE)

• Learning for novel problems– Transfer learning from prior experience

• Locality– Trust (and when to stop trusting a model)– Anomaly detectors– Incremental repair

• Privacy (by exploiting locality)

• Goal-oriented analysis (next gen multi-objective optimizers)

• Human skill amplifiers– Data analysis patterns– Anti-patters in data science

Q: Are there general results in SE?A: Hell, yeah (Turkish Toasters to NASA rockets)

Tim Menzies with Burak Turhan, Ayse Basar Bener, Justin S. Di Stefano: On the relative value of cross-company and within-company data for defect prediction. Empirical Software Engineering 14(5): 540-578 (2009)

Lesson: reflect LESS on raw dimensions and more on underlying “shape” of the data

Transferlearning:

Across time

Across space

Tim Menzies with Ekrem Kocaguneli, Emilia Mendes, Transfer learning in effort estimation Empirical Software Engineering (2014) 1-31

PROMISE, lessons learned:Need to study the analysts more

• Lesson from ten years of PROMISE:– Different teams– Similar data– Similar learners– Wildly varying results

• Variance less due to– Choice of learner– Choice of data set– Choice of selected features

• Variance most due to teams

• “It doesn’t matter what you do but it does matter who does it.”

Martin Shepperd, David Bowes, and Tracy Hall; Researcher Bias: The Use of Machine Learning in Software Defect Prediction; IEEE Trans SE, 40(6) JUNE 2014 603

Maybe more to data science than “I applied learner to data”.

For example: here’s the code I wrote in my latest paper

Data analysis patterns• Recommender systems for data mining scripts|• Method1:

– Text mining on documentation– Lightweight parsing of the scripts– Raise a red flag

• If clusters of terms in parse different to documentation

• Method2: track script executions– Learn “good” and “bad” combinations of tools – “Good” = high accuracy, less CPU, less memory,

less power– Combinations = markov chains of this used before

that• Case studies :

– ABB, software engineering analytics– HPCC ECL, funded by LexisNexis– Climate modeling in “R” (with Nagiza

Samatova)

Using “underlying shape” to implement privacy

• Goal: – Share, but do not reveal, while maintaining signal

• Cliff & Morph [Menzies&Peters’13]:– Replace N rows with M prototypes (M << N)

• N – M individuals now fully private– Mutate prototypes up to, but not over, mid-point of prototypes

• Remaining M individuals now obfuscated.

before afterOne of the few known privacy algorithms that does not harm

data mining

Tim Menzies, Fayola Peters, Liang Gong, Hongyu Zhang: Balancing Privacy and Utility in Cross-Company Defect Prediction. IEEE Trans. Software Eng. 39(8): 1054-1068 (2013)

Using “underlying shape” for(1) trust(2) anomaly detectors, (3) incremental repair A: Clustering

Tim Menzies, Andrew Butcher, David R. Cok, Andrian Marcus, Lucas Layman, Forrest Shull, Burak Turhan, Thomas Zimmermann: Local versus Global Lessons for Defect Prediction and Effort Estimation.

IEEE Trans. Software Eng. 39(6): 822-834 (2013)

Using “underlying shape” for next gen multi-objective optimizers?

A: (1) Cluster, then (2) envy, then (3) contrast

BTW: “Shape-based” optimization much faster than standard MOEAs

Software process modeling• Solid= last slide;

dashed = NSGA-II [Deb02]• Also, we find tiny trees summarizing

trade space

Cockpit software design

• Red line= using trees to find optimizations

tim menzies, directions in data science

se data data mining

data science2

se upload data

s of data sets

company andwithincompany

data science than6

data mining scripts

se learning

Education

menzies bagpipe

john menzies plc interim results announcement 18 august...

hayley menzies ss12 lookbook

menzies bulletin, winter 2016

jairus hihn jet propulsion laboratory, california institute...

menzies art brands

civil disobedience practice synthesis directions: … ·...

ntcities - walter menzies

applications of psychological science for actionable...

software analytics: whatÕs next? -...

john menzies plc results presentation...john menzies plc 3...

christine menzies, director

©tim menzies, 2002menzies.us/pdf/02ml4aaai.pdf · hello,...

the menzies brief - menzies foundationthe menzies brief...

the robert menzies collection at the university of...

menzies pneumatology 1

1421 bibliography - gavin menzies

menzies international (aust.) pty ltd...commercial in...

elections’ directions reflections bishop tim smith

research directions in cricket - simon fraser...