data science highlights

63
Data Science Highlights

Post on 18-Oct-2014

1.423 views

Category:

Technology


0 download

DESCRIPTION

Highlights and summary of long-running programmatic research on data science; practices, roles, tools, skills, organization models, workflow, outlook, etc. Profiles and persona definition for data scientist model. Landscape of org models for data science and drivers for capability planning. Secondary research materials.

TRANSCRIPT

Page 1: Data Science Highlights

Data Science Highlights

Page 2: Data Science Highlights
Page 3: Data Science Highlights

Data Scientist Square - San Francisco Bay Area Job Description

Square is hiring a Data Scientist on our Risk team. The Risk team at Square is responsible for enabling growth while mitigating financial loss associated with transactions. We work closely with our Product and Growth teams to craft a fantastic experience for our buyers and sellers. !Desired Skills & Experience

As a Data Scientist on our Risk team, you will use machine learning and data mining techniques to assess and mitigate the risk of every entity and event in our network. You will sift through a growing stream of payments, settlements, and customer activities to identify suspicious behavior with high precision and recall. You will explore and understand our customer base deeply, become an expert in Risk, and contribute to a world-class underwriting system that helps Square provide delightful service to both buyers and sellers. To accomplish this, you are comfortable writing production code in Java and conducting exploratory data analysis in R and Python. You can take statistical and engineering ideas from prototype to production. You excel in a small team setting and you apply expert knowledge in engineering and statistics. Responsibilities

1. Investigate, prototype and productionize features and machine learning models to identify good and bad behavior. 2. Design, build, and maintain robust production machine learning systems. 3. Create visualizations that enable rapid detection of suspicious activity in our user base. 4. Become a domain expert in Risk. 5. Participate in the engineering life-cycle. 6. Work closely with analysts and engineers. !

Requirements 1. Ability to find a needle in the haystack. With data. 2. Extensive programming experience in Java and Python or R. 3. Knowledge of one or more of the following: classification techniques in machine learning, data mining, applied statistics, data

visualization. 4. Concise verbal and written articulation of complex ideas. !

Even Better 1. Contagious passion for Square’s mission. 2. Data mining or machine learning competition experience. !

Company Description

Square is a revolutionary service that enables anyone to accept credit cards anywhere. Square offers an easy to use, free credit card reader that plugs into a phone or iPad. It's simple to sign up. There is no extra equipment, complicated contracts, monthly fees or merchant account required.Co-founded by Jim McKelvey and Jack Dorsey in 2009, the company is headquartered in San Francisco.

Page 4: Data Science Highlights

Sense Maker Segment

Sense makers need to create and/or employ insights to accomplish their business goals and satisfy their responsibilities. !These insights emerge from independent and collaborative discovery efforts that involve direct interaction with discovery applications, and participation in discovery environments.

Insight Consumer !

Analyst Casual Analyst Data Scientist Analytics Manager !

Problem Solver

Page 5: Data Science Highlights

Data Scientist: Profile

Page 6: Data Science Highlights

Data Scientist Data Scientist / Senior Research Scientist

Data Scientists work with other members of the Data science team, using emerging methods and tools to engage with ‘Big Data’ from a variety of external and internal sources. Data Scientists aim to generate actionable insights that transform the organization; enhance existing products, services and operations; and identify, define and prototype new data-driven products, services, and offerings. They have advanced analytical skills and/or a specialized educational background, and rely on open-source and custom-created tools, to address the ad-hoc and open-horizon questions the Data Science team takes on. Data Scientists collaborate with Insight Consumers, evolving and publishing insights and prototypes of new offerings.

Business Goals & Work Setting

• Create new data-driven products, services, business opportunities

• Transform the business with insights derived from Big Data

• Create effective tools and infrastructure for the data science group and other analytical groups within the organization

• Develop prototypes based on proprietary or open source tools

• Prototype new ways to visualize and understand data relationships

• May work within a business unit, providing analytical capability to that unit only, or a centralized Data Science group

!Discovery Needs

• Solves complex, critical problems & significant and unique issues.

• Have numerous and dynamic ill-formed questions with unpredictable needs for data, visualization, discovery capabilities

!Discovery Tools

• Open source tools and platforms for big data, ETL, visualization, analysis, statistics: Hadoop, Cassandra, Kafka, Voldemorte,

• Open source algorithms languages: R, HIVE, PIG,

• Custom-developed analytical tools

Engagement w/ Discovery Applications

• Creates custom discovery applications to suit their own needs

• Application lifecycle involvement: rolls their own from scratch, iterates and then publishes to wider audiences / productizes

• Original author of all discovery solution elements: data / data sets, information models, discovery applications and workspaces

• Shares / publishes insights to decision-making groups & social forums in the business

!Collaboration

• Works with Engineers and Software Architects to create prototypes and products

• Collaborates with Data Scientists on ill-formed questions

!Skills & Expertise

• Data management, analytics modeling and business analysis

• Prototyping / software engineering

• Discovery: advanced statistics, quantitative and qualitative analysis, machine learning, data mining, natural language processing, computational linguistics, broad knowledge of applied mathematics, statistical methods and algorithms

Page 7: Data Science Highlights

Profiles & Discovery Problem Spectrum

Data Scie

ntist

Analyst

(all)

Casual

Analyst

Problem

Solver

Ill-formed Well-formed

Page 8: Data Science Highlights

The ‘Conway Model’

Page 9: Data Science Highlights
Page 10: Data Science Highlights

http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png

Page 11: Data Science Highlights

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

Page 12: Data Science Highlights

What sort of animal?

They seem different than analysts: • problem set • relationship to discovery tools • skills and professional profile • discovery / analytical methods • perspective • workflow and collaboration !

Are they? How?

Page 13: Data Science Highlights

Areas of Investigation

• Workflow • Environment • Organizational model • Pain points • Tools • Data landscape • Analytical practices • Project structure • Unmet needs

Page 14: Data Science Highlights
Page 15: Data Science Highlights

Interviews

Page 16: Data Science Highlights

Discussion GuideCan you please walk me through a recent or current project?

a. How was the project initiated? b. How defined was the business problem in the beginning? Did the problem change? c. Where/who did you obtain data sets from? How did you make the decision? d.Describe the data you used: How did the data sets look like? How big were they? Were they structured or unstructured? e. What tools or techniques did you use to do the analyses? Did they map to the specific steps you mentioned just now? f. How did you decide these were the tools/techniques to use? To what extent were these decisions made by yourself and to what extent were they standardized by your group/team? g. How did you present the results of your analyses? What tools did you use? What do you like and dislike about your current tool set? h. Which stage of this project was the most challenging? To what extent did the tools satisfy what you intended to do? What features were lacking?

i. How much collaboration was there during each stage of the project? i. Background and role of collaborators ii. Collaboration modes iii. Types of information shared !

Thinking about the projects you have worked on, is there a common approach you take to address these problems?

How did you decide on this approach/tools? !

Page 17: Data Science Highlights

Transcripts & Recordings

Page 18: Data Science Highlights

Synthesis

Page 19: Data Science Highlights

Findings

Page 20: Data Science Highlights

Business Analytics (future)

Data Science

(now)=

Page 21: Data Science Highlights
Page 22: Data Science Highlights

Creates  data-­‐driven  insights,  offerings,  and  resources  to  transform  the  organiza7on

Work  Experience    10  Years  Educa0on  Ph.D.  Sta7s7cs,  MS  Bio-­‐Informa7cs

Job  Title    Senior  Data  Scien7st  Company    LInkedIn

Summarize  &  Communicate  !Review  findings  with  colleagues;  summarize  ,visualize,  and  communicate  key  findings  to  Insight  Consumers/decision  makers

Prototype  &  Experiment  with  data  driven  feature:  !How  can  we    prototype/evaluate  this  w/out  disrup0ng  the  site?

Gather  Data  &  Analyze  Results  !Use  descrip0ve,  inferen0al,  and  predic0ve  sta0s0cs  to  evaluate    results

Analyze  &  Iden7fy  causal/predic7ve  factors:  Who  are  the  best  candidates  to  contact  for  a  job  based  on  recruiter  needs  and  profile  content?

Dana  Data  Scien0st  

• Defining  and  capturing  useful  measures  of  online  aMen0on  

• GeOng  all  the  data  analy0c  tools  to  work  together  properly    

• No  current  workflow  support  or  tools  for  data  wrangling,  analysis,    experimenta0on,,  and  prototyping

• Effec0ve  tools  to  help  experiment  with  and  evaluate  value  /u0lity  of  features  and  ac0vi0es  for  users  

• Ability  to  rapidly  prototype  data-­‐driven  features  w/out  risk  of  online  service  disrup0ons

• Open  source  data  manipula0on,  mining  &  analysis  tools  including  R,  Pig,  Hadoop,  Python,  etc.    

• Sta0s0cal  packages  such  as  SAS,  SPSS,  etc.  • Custom  analy0cal  tools  built  using  open  source  components  and  languages

• Leverage  data  to  support  the  org  mission  • Enhance  products  &  services  with  data-­‐driven  insights  and  features  

• Use  data  to  iden0fy  new    opportuni0es  and  prototype/drive  new  customer  offerings  

• Create  useful  data  sets/streams,  measures,  &  resources  (e.g.,  data  models,  algorithms,  etc.

Key  Goals

Tools

Pain  PointsWish  List

Sample  Workflow

Dana  is  a  Senior  Data  Scien0st  who  has  worked  at  LinkedIn  for  5  years.    Dana’s  educa0on  includes  a  Ph.D.  in  Sta0s0cs  and  an  MS  in  Bio  Informa0cs.    Dana’s  previous  work  includes  posi0ons  in  academic  research  groups  as  a  doctoral  candidate  and  post-­‐doc,  as  well  as  so_ware  engineering  roles  in  the  Internet  &  technology  industries.

• Dana  works  with  several  other  data  scien0sts  and  her  Analy0cs  Manager  on  a  centralized  team  

• Dana  and  her  colleagues  aim  to  create  data  driven  insights,  features,  resources,  and  offerings  that  deliver  strategic  value  to  LinkedIn  

• Dana  works  with  Analysts  on  other  teams  to  define  and  create  discovery  tools,  data  sets,  and  methods  for  use  by  their  groups  at  LinkedIn.  

• Dana  &  team  are  visible  &  well  established  within  LinkedIn,  and  have  a  voice  in  product  strategy  and  opera0onal  context;  they  have  a  high  degree  of  autonomy  in  defining  data  science  projects  

• Dana  works  with  Insight  Consumers  to  suggest  and  determine  poten0al  new  data  driven  offerings  to  prototype  and  evaluate.

• How  can  we  leverage  data  to  increase  online  engagement  with  LinkedIn?    • How  should  we  measure  engagement  &  what  factors  drive  it?  • What  aspects  of  a  personal  profile  are  most  likely  to  encourage  /  discourage  new  connec0ons  between  people?  

• How  can  we  increase  people’s  ac0vity  and  contribu0ons  to  topical    discussion  groups?  

• What  factors  drive  the  effec0veness  of  our  marke0ng  campaigns?    • Why  did  one  of  our  marke0ng  campaigns  work  excep0onally  well?  

• How  can  leverage  data  to  help  recruiters  iden0fy  and  communicate  effec0vely    with  qualified  and  poten0ally  available  candidates?

Typical  Discovery  Scenarios  &  Problems

Background

Work  Context

• Mines,  analyzes,  &  experiments  with  data  to  iden0fy  paMerns,  trends,  outliers,  causal  factors,  predic0ve  models,  &  opportuni0es  

• Defines  and  explains  newly  devised  measurements,  predic0ve  models,  &  insights  

• Compares  effec0veness  of  opera0ons  at  achieving  company  goals  for  engagement,  growth,  data  quality  

• Produces  &  explores  new  data  sets  • Collaborates  with  other  data  scien0sts  to  capture  new  data  streams  

• Prototypes  new  data  driven  site  features/offerings  

• Runs  data  based  experiments  to  test/evaluate  models,  hypotheses  &  prototypes  

• Communicates  &  explains  analyses  to  colleagues  &  Insight  Consumers

I’ll  do  whatever  it  takes  –  wrangle,  extract,  manipulate,  analyze,  experiment,  prototype  –  to  use  data  to  drive  value  &  innovate

“    

”Ac7vi7es

Page 23: Data Science Highlights

Empirical

Page 24: Data Science Highlights

AugmentedAugmented

Page 25: Data Science Highlights

AcceleratedAccelerated

Page 26: Data Science Highlights

Cooperative

Page 27: Data Science Highlights

Business Analytics Data Science

Intuitive

Manual

Gradual

Individual

Empirical

Augmented

Accelerated

Cooperative*

Nature of sense making activity

Page 28: Data Science Highlights

The Essence

• Empirical perspective • Business imperatives drive activities • Analytical approach • Recipe is always the same

• Engineering always present • Data challenges are paramount

• consume 60% - 80% of time and effort • Data volumes range huge to moderate (PB > MB)

• Domain often drives analysis • Data scientists already have self-service • Some new problems, many the same • Use ‘advanced’ analytics, not conventional BA • Innovate by applying known analyses to new data • Current workflow fragmented across tools and data stores • Success can be a model, product, insight, infrastructure, tool

Page 29: Data Science Highlights

State of the Discipline

A small set of formally constituted Data Science teams at major Internet and technology companies (Facebook, Google, MicroSoft, Yahoo, Twitter, LinkedIn, eBay, Amazon) lead the field in most identifiable respects: • maturity of practice - sophistication of methods, quality of infrastructure • history and tenure as formal function / group • business integration and impact • internal and public visibility • pace of innovation in methods, tools, architecture • quality and rate of contributions to open source and other tools /

infrastructure • role in the industry and public discourse on data science: visibility in

community, publication of experiments and findings, etc.

Page 30: Data Science Highlights

Tooling & Infrastructure

Leading shops have their own comprehensive and often home-built / heavily customized data science environments, tools, infrastructure. !This infrastructure is aligned to the particulars of their domain and business. Their data science environments are sometimes considerably more 'mature' than those of other shops. !The large majority of existing data science teams and practices are 'followers' of these leaders, in the sense that while they have idiosyncratic problems and varying domains to address, they rely on innovation from the DS leaders to guide the evolution of their data science practices. !Their environments reflect a mix of some purpose-built data science components, and infrastructure extended / adapted from business analytic needs such as BI.

Page 31: Data Science Highlights

Tooling & Infrastructure

Many organizations are establishing new data science capabilities. A minority of these create new data science teams / practices from scratch without building out other conventional analytical capabilities such as BI. They will need new environments to support data science activities, and may leapfrog older generations of analytic environment, following leaders by directly creating new 'stacks' oriented more specifically for data science. !The majority of organizations are creating new data science capabilities by building on existing analytical groups and functions. In terms of environments and infrastructure, these organizations have existing analytical environments aligned to BI and other business analytic functions, not specifically adapted to data science needs. Cumulative investment in these environments can be very high. !New teams will need new tools. Existing teams will need new tools to support new discovery activities !Berkeley Data Analytics Stack is the most visible open source 'platform' at the moment. No interview participants mentioned it.

Page 32: Data Science Highlights

Organizational Model

Data science capability = provisioned via standard org models (ranging across in house, external, centralized, embedded, etc.). !The ways data science teams and practice groups are managed and their relationship to the orgs they are part of seems to be conventional / familiar. !We can summarize the landscape of organizational models for providing data science capability by plotting the size of data science team / pool of resources vs. the 'distance' from the problem / need. !Landscape reflects common patterns for specialized expertise. !This could shift over time as discovery maturity increases overall first within the analytics industry, then within the general business realm.

Page 33: Data Science Highlights

Discovery Problems

Discovery efforts are set in motion by Insight Consumers, not Data Scientists. The success of efforts is gauged by Insight consumers. Insights are used by the originating Insight Consumers, not other analysts, and rarely other Insight Consumers. !Multiple hypotheses are often explored in parallel, supported by multiple data sets / interim data products. !Useful reconstructing of analytical workflows requires linear history of all steps / activities.

Page 34: Data Science Highlights

Discovery Problems

Data science resources - Individuals, projects, and teams - are always aligned to business areas or strategic goals: e.g. the Content Insights team at LinkedIn supports analytical goals related to LinkedIn's major push to enhance its media presence and role in media. !At large scales of group, this inverts - for example within a company, communities of practice are aligned to a discipline, and will include members who's activities span the needs of all the business units. !No analytical efforts begin completely open-ended, with no idea of the nature or import of resulting insights. !There is almost always a hypothesis, or more than one. (Even in more academic / research oriented settings, there is no basic research - all investigations are purposive and grounded in defined business intent.

Page 35: Data Science Highlights

PROBLEM NATURE

• Well-defined • Explicit form: Why, What, and How questions • Implicit form: which question

• Hypothesis are driven by domain knowledge or work experience

• Not very different from the problems business analysts address !vBusinesses address the same problems they have been working on, which are

determined in the very beginning before resources should be allocated. Data scientists do not necessarily contribute to initiating new problems.

Page 36: Data Science Highlights

Data Science

Insight

Model

Insight

Model

Data Product

Product

Analysts

Outcomes

Page 37: Data Science Highlights

Skills Portfolio

Data scientists use three kinds of languages: analysis (R- Matlab), scripting (python, perl), data processing (sql, pig) !Analytical environments should allow integration of languages / capabilities they offer. !Every analyst has their preferred language / method - defaults to using their own for analytical efforts. True within centralized analytical teams.

Page 38: Data Science Highlights

Skills

Page 39: Data Science Highlights

Discovery Maturity

• Discovery is poorly understood and little recognized as a capability. It is rarely mentioned by any of the Data Science / Analytics professionals spoken with. When mentioned, it is seen as a small-scale activity and / or a desired outcome of particular projects, not something the organization needs to be able to in an ongoing / comprehensive / large-scale fashion such as understanding customers. !

• Data scientists understand their own challenges in terms of what stages / aspects of a data-centric workflow require greatest time, effort, or present most complexity or potential for introducing uncertainty / ambiguity into the efforts. Broader framings are the need for or desire to work on data-driven products, or transform and improve business through offering data-centered insights. !

• Product-centric data scientists (aim directly at making data-driven offerings) are a small minority of the active community. Many more are engineers with strong data skills, and many more analysts trying to acquire data science skills / perspective.

Page 40: Data Science Highlights

Supporting Factors

• Regardless of particulars, the core ingredients remain the same: analytical skills and perspective, domain knowledge, engineering / tooling skills and perspective !

• In data science practices, analysis is always enabled by engineering - either localized to the data science team, or centrally provided via IT. !

• In BI practices, analysis is always enabled by IT and systems consultants / integrators (in house or external). !

• Leading DS groups rely on a number of hybrid approaches to support data cleansing and the evaluation of models, insights, and results - e.g. crowd source prep of data and checking of results for prototypes and experiments. !

• Data scientists rarely productionize code, analytical workflows, analytical tools. Engineers / IT convert 'prototype' artifacts created by data scientists into production code / tools.

Page 41: Data Science Highlights

Perspective

Analytical The analytical perspective is the center of definition for all analytical roles. Contrast with engineers, who "make stuff". Analytical roles figure things out for some purpose: whether a model to inform a product prototype or provide insight.

!

Empirical The empirical perspective is distinct from the analytical perspective, and marks 'true' data scientists. This revolves around framing and testing hypotheses formally and informally, often requires validation and interrogation of experimental methods and results by others, expects significant degree of transparency at (all) stages of the analytical effort.

Page 42: Data Science Highlights

Cooperation and Collaboration

• Discovery efforts are structured as individual efforts - insights come from individual analytical engagement with data sets. !

• Collaboration between analysts is asynchronous. !

• Diversity of analytical tools / languages in practice = barrier to cooperation and collaboration. !

• There is little re-use of analytical insights by analysts to further other efforts. !

• When tools and/or problem domains are stable / known, analysts create individual and group assets for reuse - e.g. R script libraries, code snippets for SAS, templates for data set file formats and structures !

• Intermediate work products created during analytical work (data sets / subsets, code, analytical scripts, algorithms, interim results, hypotheses,) perceived as often irrelevant or throwaway, if not outright wrong. Little investment is made to annotate / preserve intermediate work products for individual or group re-use, sharing, review.

Page 43: Data Science Highlights

THE MANY SHADES OF COLLABORATION

Independent: Have-it-all type data scientist (I know, I design & I implement)

Linear: Complementary (Analysts know, data scientists design, engineers implement)

Project-based: The missing piece ( Data scientists lead or support engineers)

Consultancy: From abstract to concrete (Some data scientists know & design, some other data scientists implement)

Page 44: Data Science Highlights

Data Landscape

• The physical location of data - where stored / what environment - is a significant cost factor for almost all aspects of analytical work. !

• Distributed data (managed / located in multiple stores) increases costs for many individual steps in analytical workflows. !

• Distributed data costs often = barrier to conducting insightful analysis using multiple techniques / steps. Default to basic / simple analysis to avoid high effort / low probability of success. !

• For analysts with low levels of db / data wrangling skill, even marginal distributed data costs = preventative barrier for engaging with data. !

• Most analysts reported having to migrate all of the data sets into the same data processing framework to begin analysis. [If all the data were in one place...]

Page 45: Data Science Highlights

DATA NATURE

• Messy: various forms (Web logs, web pages, genome data, sales revenues….)

• Scattered: Data scientists have to search from the wild (outside of enterprise databases)

• Started “Big”, ended “Lean”: Meaningful data units are small in size

• Standardization is key to all data science work: why engineers become data scientists

!v Data scientists are “data foragers“ and “data format equalizers”. They have the ability

to manipulate large data sets and gradually narrow the data sets down to the exact units needed for analysis.

Page 46: Data Science Highlights

Algorithms and Analytical Tools

• Well-known algorithms and methods are used to plan and structure experiments, discover insights, drive the creation of new models, evaluate the effectiveness of new models & products. !

• The algorithm and method are often determined by domain, such as TF-IDF for IR, Smith-Waterman for bioinformatics,

Page 47: Data Science Highlights

PROCESS NATURE

• Wicked: Solutions are often times hardly pre-defined

• Iterative three-step cycle: Data collection, data cleansing, & data analysis

• trial-and-error: Hypotheses revision, hypotheses validation, & data recollection

• Ad-hoc analysis chance encountering

!v Data scientists provide new perspectives to address old problems. The path to the

solution is usually exploratory. But the goal has always been clear and pre-defined.

Page 48: Data Science Highlights

Data Science Workflows

http://strata.oreilly.com/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html

Page 49: Data Science Highlights

Data Science Workflows

Page 50: Data Science Highlights

Data Science Workflow

• Frame problem / goal of effort • Identify and extract data to be used in effort from whole corpus / totality of available data

• Exploratory identification and selection of working data for use in experiments

• Define experiment(s): hypothesis / null hypothesis, methods, success criteria • Derive insight(s)

• Wrangle, process, visualize, interpret • Codify / create new model reflecting insights outcomes from experiments • Validate new model(s) • Provision training data • Train new model • Validation and outcome of training model • Hand-off for implementation on production systems / as production code

Page 51: Data Science Highlights

Analysis Workflow & Activities

• Empirical analysis of subsets of data • Understand topology of data, boundaries (sets / subsets, complete corpus,

totality of data) • Outlier identification and profiling

• How significant are outliers to overall topology • Comparative exclusion and profiling of resulting data subsets to understand their role,

discover principal components

• Find and analyze patterns, areas of interestingness / deserving attention • Find and analyze central actors / factors (in existing model that produced

source data, in topology of working data, in patterns, etc.) • ID and understand their impact on local and global data topology and primary metrics if in

several ways / more than one axis / at the same time

• Discover and analyze relationships amongst central actors • Understand cycles, trends, changes (dynamic characteristics) for core

actors, topology, patterns and structure • Understand causal factors

• Codify / create new model reflecting insights & outcomes from experiments

Page 52: Data Science Highlights

• dynamic working data sets & subset • iterative • experimental frame

Page 53: Data Science Highlights

Key Workflows

Insight Consumer <> Data Scientist originate, define, address discovery effort

!Data Scientist > Data Engineer

create & evolve apps to address new & in-progress efforts !Analyst <> Analyst

define & address in-progress discovery efforts !Data Scientist > internal networks

create & curate archive & community

Page 54: Data Science Highlights

Needs

What are the most common and useful statistical techniques you use during discovery and analysis efforts? !What statistical capabilities or functions would be very useful if provided within discovery applications, and where would they be useful?

“(1)  The  most  commonly  used  sta0s0cal  techniques  used  to  date  (in  our  strategic  planning  work)  are:    dimensionality  reduc0on  (par00on  clustering,  mul0ple  correspondence  analysis),  factor  analysis,  par00on  clustering  (k-­‐means,  k-­‐medoids,  fuzzy  clustering),  cluster  valida0on  techniques  (silhoueMe,  dunn’s  index,  connec0vity),  mul0variate  outlier  detec0on,  linear  regression,  and  logis0c  regression.” !(2)  Techniques  that  would  assist  with  iden0fying  outliers  or  invalid  data.    Much  of  this  work  seems  to  be  done  by  hand.    I  believe  that  we  are  also  geOng  to  the  point  where  we  could  start  using  linear  regression  and  splines  (for  showing  trends).”

Page 55: Data Science Highlights

Needs

For example, would system-generated descriptive statistical visualizations be useful for whole data sets - or for smaller user-selected groups of attributes? !Would it be useful for the application to analyze and suggest possible distribution models it sees in the data; for the values of individual attributes, and/or for larger sets of data?

“With  regards  to  your  last  ques0on  on  visualiza0on,  we  have  put  in  significant  effort  to  use  visualiza0on  in  our  Endeca  installa0on.    We  have  built  visualiza0ons  such  as  tree  maps,  flow  diagrams,  sun  burst  diagrams,  scaMer  plots  showing  clusters,  and  hierarchical  edge  bundling  diagrams  to  explore  our  data  sets.      !Our  data  tends  to  be  qualita0ve  rather  than  quan0ta0ve  so  this  drives  much  of  our  visualiza0ons. !So  yes,  interac0ve  descrip0ve  sta0s0cal  visualiza0on  would  be  helpful  –  on  the  complete  data  set  and  individual  aMributes.”

Page 56: Data Science Highlights

Needs

1. What are the most common statistical techniques you use at work - descriptive, inferential, or otherwise? What are the most valuable? !2. What are the most common visualizations you use to present findings or share insights? What are the most valuable?

“(1) We do a lot of chi-square tests, permutation tests, false discovery rate correction, Bonferroni correction, 2x2 Fisher exact test, logistic regression.  !!I also use SVM, Artificial Neural Networks (ANN), Naive-Bayes Classifiers (NBC), parts of speech taggers.”!!(2) ROC curves, tables with p-values or odds ratios or hazard ratio (http://en.wikipedia.org/wiki/Hazard_ratio)!!Things  p-value!XYZ1    0.001!XYZ2 ...!etc.”

Page 57: Data Science Highlights

Needs

1. What are the most common statistical techniques you use at work - descriptive, inferential, or otherwise? What are the most valuable? !2. What are the most common visualizations you use to present findings or share insights? What are the most valuable? !“Logistic Regression, Decision Trees, Markov Models, Area Under Curve”

Page 58: Data Science Highlights

Casual Analyst

Analytical Manager

Data Skills Level

Customize Models

Low / none

High

Composition CapabilityLow / Use High / Make

Create New Models

Create Complex Models

Analyst

Sense Makers: Information Management Ability

Use Models

Problem Solver

Data Scientist

Page 59: Data Science Highlights

Materials• http://www.datasciencecentral.com/ • Ben Lorica’s blog: http://strata.oreilly.com/ben • https://blog.twitter.com/tags/twitter-data • http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853

Page 60: Data Science Highlights
Page 61: Data Science Highlights

Algorithms (ex: computational complexity, CS theory) Back-End Programming (ex: JAVA/Rails/Objective C) Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS) Big and Distributed Data (ex: Hadoop, Map/Reduce) Business (ex: management, business development, budgeting) Classical Statistics (ex: general linear model, ANOVA) Data Manipulation (ex: regexes, R, SAS, web scraping) Front-End Programming (ex: JavaScript, HTML, CSS) Graphical Models (ex: social networks, Bayes networks) Machine Learning (ex: decision trees, neural nets, SVM, clustering) Math (ex: linear algebra, real analysis, calculus) Optimization (ex: linear, integer, convex, global) Product Development (ex: design, project management) Science (ex: experimental design, technical writing/publishing) Simulation (ex: discrete, agent-based, continuous) Spatial Statistics (ex: geographic covariates, GIS) Structured Data (ex: SQL, JSON, XML) Surveys and Marketing (ex: multinomial modeling) Systems Administration (ex: *nix, DBA, cloud tech.) Temporal Statistics (ex: forecasting, time-series analysis) Unstructured Data (ex: noSQL, text mining) Visualization (ex: statistical graphics, mapping, web-based dataviz)

Page 62: Data Science Highlights

Skills

Page 63: Data Science Highlights

Figure 3-3. There were interesting partial correlations among each respondent’s primary Skills Group (rows) and primary Self-ID Group!(columns). The mosaic plot illustrates the proportions of respondents!who fell into each combination of groups. For example, there were few!Data Researchers whose top Skill Group was Programming.

Skills