data science highlights

Data Science Highlights

Data Scientist Square - San Francisco Bay Area Job Description

Square is hiring a Data Scientist on our Risk team. The Risk team at Square is responsible for enabling growth while mitigating financial loss associated with transactions. We work closely with our Product and Growth teams to craft a fantastic experience for our buyers and sellers. !Desired Skills & Experience

As a Data Scientist on our Risk team, you will use machine learning and data mining techniques to assess and mitigate the risk of every entity and event in our network. You will sift through a growing stream of payments, settlements, and customer activities to identify suspicious behavior with high precision and recall. You will explore and understand our customer base deeply, become an expert in Risk, and contribute to a world-class underwriting system that helps Square provide delightful service to both buyers and sellers. To accomplish this, you are comfortable writing production code in Java and conducting exploratory data analysis in R and Python. You can take statistical and engineering ideas from prototype to production. You excel in a small team setting and you apply expert knowledge in engineering and statistics. Responsibilities

1. Investigate, prototype and productionize features and machine learning models to identify good and bad behavior. 2. Design, build, and maintain robust production machine learning systems. 3. Create visualizations that enable rapid detection of suspicious activity in our user base. 4. Become a domain expert in Risk. 5. Participate in the engineering life-cycle. 6. Work closely with analysts and engineers. !

Requirements 1. Ability to find a needle in the haystack. With data. 2. Extensive programming experience in Java and Python or R. 3. Knowledge of one or more of the following: classification techniques in machine learning, data mining, applied statistics, data

visualization. 4. Concise verbal and written articulation of complex ideas. !

Even Better 1. Contagious passion for Square’s mission. 2. Data mining or machine learning competition experience. !

Company Description

Square is a revolutionary service that enables anyone to accept credit cards anywhere. Square offers an easy to use, free credit card reader that plugs into a phone or iPad. It's simple to sign up. There is no extra equipment, complicated contracts, monthly fees or merchant account required.Co-founded by Jim McKelvey and Jack Dorsey in 2009, the company is headquartered in San Francisco.

http://www.linkedin.com/companies/675562?dspporc=&trk=jobtocomp&goback=%2Efps_PBCK_data+scientist_*1_*1_*1_*1_*1_*1_*2_*1_Y_*1_*1_*1_false_1_R_*1_*51_*1_*51_true_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2_*2%2Efjs_data+scientist_*1_*1_I_us_02138_50_1_R_true_*1_*2_*2_*2_*2_*2_*2_*2_*2%2Evjs_5183715_*2_*2_*2_false_*2_*2%2Evjs_5131710_*2_*2_*2_false_*2_*2

Sense Maker Segment

Sense makers need to create and/or employ insights to accomplish their business goals and satisfy their responsibilities. !These insights emerge from independent and collaborative discovery efforts that involve direct interaction with discovery applications, and participation in discovery environments.

Insight Consumer !

Analyst Casual Analyst Data Scientist Analytics Manager !

Problem Solver

Data Scientist: Profile

Data Scientist Data Scientist / Senior Research Scientist

Data Scientists work with other members of the Data science team, using emerging methods and tools to engage with ‘Big Data’ from a variety of external and internal sources. Data Scientists aim to generate actionable insights that transform the organization; enhance existing products, services and operations; and identify, define and prototype new data-driven products, services, and offerings. They have advanced analytical skills and/or a specialized educational background, and rely on open-source and custom-created tools, to address the ad-hoc and open-horizon questions the Data Science team takes on. Data Scientists collaborate with Insight Consumers, evolving and publishing insights and prototypes of new offerings.

Business Goals & Work Setting

• Create new data-driven products, services, business opportunities

• Transform the business with insights derived from Big Data

• Create effective tools and infrastructure for the data science group and other analytical groups within the organization

• Develop prototypes based on proprietary or open source tools

• Prototype new ways to visualize and understand data relationships

• May work within a business unit, providing analytical capability to that unit only, or a centralized Data Science group

!Discovery Needs

• Solves complex, critical problems & significant and unique issues.

• Have numerous and dynamic ill-formed questions with unpredictable needs for data, visualization, discovery capabilities

!Discovery Tools

• Open source tools and platforms for big data, ETL, visualization, analysis, statistics: Hadoop, Cassandra, Kafka, Voldemorte,

• Open source algorithms languages: R, HIVE, PIG,

• Custom-developed analytical tools

Engagement w/ Discovery Applications

• Creates custom discovery applications to suit their own needs

• Application lifecycle involvement: rolls their own from scratch, iterates and then publishes to wider audiences / productizes

• Original author of all discovery solution elements: data / data sets, information models, discovery applications and workspaces

• Shares / publishes insights to decision-making groups & social forums in the business

!Collaboration

• Works with Engineers and Software Architects to create prototypes and products

• Collaborates with Data Scientists on ill-formed questions

!Skills & Expertise

• Data management, analytics modeling and business analysis

• Prototyping / software engineering

• Discovery: advanced statistics, quantitative and qualitative analysis, machine learning, data mining, natural language processing, computational linguistics, broad knowledge of applied mathematics, statistical methods and algorithms

Profiles & Discovery Problem Spectrum

Data Scie

ntist

Analyst

(all)

Casual

Analyst

Problem

Solver

Ill-formed Well-formed

The ‘Conway Model’

http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png

http://upload.wikimedia.org/wikipedia/commons/4/44/DataScienceDisciplines.png

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

http://nirvacana.com/thoughts/wp-content/uploads/2013/07/RoadToDataScientist1.png

What sort of animal?

They seem different than analysts: • problem set • relationship to discovery tools • skills and professional profile • discovery / analytical methods • perspective • workflow and collaboration !

Are they? How?

Areas of Investigation

• Workflow • Environment • Organizational model • Pain points • Tools • Data landscape • Analytical practices • Project structure • Unmet needs

Interviews

Discussion GuideCan you please walk me through a recent or current project?

a. How was the project initiated? b. How defined was the business problem in the beginning? Did the problem change? c. Where/who did you obtain data sets from? How did you make the decision? d.Describe the data you used: How did the data sets look like? How big were they? Were they structured or unstructured? e. What tools or techniques did you use to do the analyses? Did they map to the specific steps you mentioned just now? f. How did you decide these were the tools/techniques to use? To what extent were these decisions made by yourself and to what extent were they standardized by your group/team? g. How did you present the results of your analyses? What tools did you use? What do you like and dislike about your current tool set? h. Which stage of this project was the most challenging? To what extent did the tools satisfy what you intended to do? What features were lacking?

i. How much collaboration was there during each stage of the project? i. Background and role of collaborators ii. Collaboration modes iii. Types of information shared !

Thinking about the projects you have worked on, is there a common approach you take to address these problems?

How did you decide on this approach/tools? !

Transcripts & Recordings

Synthesis

Findings

Business Analytics (future)

Data Science

(now)=

Creates data-‐driven insights, offerings, and resources to transform the organiza7on

Work Experience 10 Years Educa0on Ph.D. Sta7s7cs, MS Bio-‐Informa7cs

Job Title Senior Data Scien7st Company LInkedIn

Summarize & Communicate !Review findings with colleagues; summarize ,visualize, and communicate key findings to Insight Consumers/decision makers

Prototype & Experiment with data driven feature: !How can we prototype/evaluate this w/out disrup0ng the site?

Gather Data & Analyze Results !Use descrip0ve, inferen0al, and predic0ve sta0s0cs to evaluate results

Analyze & Iden7fy causal/predic7ve factors: Who are the best candidates to contact for a job based on recruiter needs and profile content?

Dana Data Scien0st

• Defining and capturing useful measures of online aMen0on

• GeOng all the data analy0c tools to work together properly

• No current workflow support or tools for data wrangling, analysis, experimenta0on,, and prototyping

• Effec0ve tools to help experiment with and evaluate value /u0lity of features and ac0vi0es for users

• Ability to rapidly prototype data-‐driven features w/out risk of online service disrup0ons

• Open source data manipula0on, mining & analysis tools including R, Pig, Hadoop, Python, etc.

• Sta0s0cal packages such as SAS, SPSS, etc. • Custom analy0cal tools built using open source components and languages

• Leverage data to support the org mission • Enhance products & services with data-‐driven insights and features

• Use data to iden0fy new opportuni0es and prototype/drive new customer offerings

• Create useful data sets/streams, measures, & resources (e.g., data models, algorithms, etc.

Key Goals

Tools

Pain PointsWish List

Sample Workflow

Dana is a Senior Data Scien0st who has worked at LinkedIn for 5 years. Dana’s educa0on includes a Ph.D. in Sta0s0cs and an MS in Bio Informa0cs. Dana’s previous work includes posi0ons in academic research groups as a doctoral candidate and post-‐doc, as well as so_ware engineering roles in the Internet & technology industries.

• Dana works with several other data scien0sts and her Analy0cs Manager on a centralized team

• Dana and her colleagues aim to create data driven insights, features, resources, and offerings that deliver strategic value to LinkedIn

• Dana works with Analysts on other teams to define and create discovery tools, data sets, and methods for use by their groups at LinkedIn.

• Dana & team are visible & well established within LinkedIn, and have a voice in product strategy and opera0onal context; they have a high degree of autonomy in defining data science projects

• Dana works with Insight Consumers to suggest and determine poten0al new data driven offerings to prototype and evaluate.

• How can we leverage data to increase online engagement with LinkedIn? • How should we measure engagement & what factors drive it? • What aspects of a personal profile are most likely to encourage / discourage new connec0ons between people?

• How can we increase people’s ac0vity and contribu0ons to topical discussion groups?

• What factors drive the effec0veness of our marke0ng campaigns? • Why did one of our marke0ng campaigns work excep0onally well?

• How can leverage data to help recruiters iden0fy and communicate effec0vely with qualified and poten0ally available candidates?

Typical Discovery Scenarios & Problems

Background

Work Context

• Mines, analyzes, & experiments with data to iden0fy paMerns, trends, outliers, causal factors, predic0ve models, & opportuni0es

• Defines and explains newly devised measurements, predic0ve models, & insights

• Compares effec0veness of opera0ons at achieving company goals for engagement, growth, data quality

• Produces & explores new data sets • Collaborates with other data scien0sts to capture new data streams

• Prototypes new data driven site features/offerings

• Runs data based experiments to test/evaluate models, hypotheses & prototypes

• Communicates & explains analyses to colleagues & Insight Consumers

I’ll do whatever it takes – wrangle, extract, manipulate, analyze, experiment, prototype – to use data to drive value & innovate

“

”Ac7vi7es

Empirical

AugmentedAugmented

AcceleratedAccelerated

Cooperative

Business Analytics Data Science

Intuitive

Manual

Gradual

Individual

Empirical

Augmented

Accelerated

Cooperative*

Nature of sense making activity

The Essence

• Empirical perspective • Business imperatives drive activities • Analytical approach • Recipe is always the same

• Engineering always present • Data challenges are paramount

• consume 60% - 80% of time and effort • Data volumes range huge to moderate (PB > MB)

• Domain often drives analysis • Data scientists already have self-service • Some new problems, many the same • Use ‘advanced’ analytics, not conventional BA • Innovate by applying known analyses to new data • Current workflow fragmented across tools and data stores • Success can be a model, product, insight, infrastructure, tool

State of the Discipline

A small set of formally constituted Data Science teams at major Internet and technology companies (Facebook, Google, MicroSoft, Yahoo, Twitter, LinkedIn, eBay, Amazon) lead the field in most identifiable respects: • maturity of practice - sophistication of methods, quality of infrastructure • history and tenure as formal function / group • business integration and impact • internal and public visibility • pace of innovation in methods, tools, architecture • quality and rate of contributions to open source and other tools /

infrastructure • role in the industry and public discourse on data science: visibility in

community, publication of experiments and findings, etc.

Tooling & Infrastructure

Leading shops have their own comprehensive and often home-built / heavily customized data science environments, tools, infrastructure. !This infrastructure is aligned to the particulars of their domain and business. Their data science environments are sometimes considerably more 'mature' than those of other shops. !The large majority of existing data science teams and practices are 'followers' of these leaders, in the sense that while they have idiosyncratic problems and varying domains to address, they rely on innovation from the DS leaders to guide the evolution of their data science practices. !Their environments reflect a mix of some purpose-built data science components, and infrastructure extended / adapted from business analytic needs such as BI.

Tooling & Infrastructure

Many organizations are establishing new data science capabilities. A minority of these create new data science teams / practices from scratch without building out other conventional analytical capabilities such as BI. They will need new environments to support data science activities, and may leapfrog older generations of analytic environment, following leaders by directly creating new 'stacks' oriented more specifically for data science. !The majority of organizations are creating new data science capabilities by building on existing analytical groups and functions. In terms of environments and infrastructure, these organizations have existing analytical environments aligned to BI and other business analytic functions, not specifically adapted to data science needs. Cumulative investment in these environments can be very high. !New teams will need new tools. Existing teams will need new tools to support new discovery activities !Berkeley Data Analytics Stack is the most visible open source 'platform' at the moment. No interview participants mentioned it.

Organizational Model

Data science capability = provisioned via standard org models (ranging across in house, external, centralized, embedded, etc.). !The ways data science teams and practice groups are managed and their relationship to the orgs they are part of seems to be conventional / familiar. !We can summarize the landscape of organizational models for providing data science capability by plotting the size of data science team / pool of resources vs. the 'distance' from the problem / need. !Landscape reflects common patterns for specialized expertise. !This could shift over time as discovery maturity increases overall first within the analytics industry, then within the general business realm.

Discovery Problems

Discovery efforts are set in motion by Insight Consumers, not Data Scientists. The success of efforts is gauged by Insight consumers. Insights are used by the originating Insight Consumers, not other analysts, and rarely other Insight Consumers. !Multiple hypotheses are often explored in parallel, supported by multiple data sets / interim data products. !Useful reconstructing of analytical workflows requires linear history of all steps / activities.

Discovery Problems

Data science resources - Individuals, projects, and teams - are always aligned to business areas or strategic goals: e.g. the Content Insights team at LinkedIn supports analytical goals related to LinkedIn's major push to enhance its media presence and role in media. !At large scales of group, this inverts - for example within a company, communities of practice are aligned to a discipline, and will include members who's activities span the needs of all the business units. !No analytical efforts begin completely open-ended, with no idea of the nature or import of resulting insights. !There is almost always a hypothesis, or more than one. (Even in more academic / research oriented settings, there is no basic research - all investigations are purposive and grounded in defined business intent.

PROBLEM NATURE

• Well-defined • Explicit form: Why, What, and How questions • Implicit form: which question

• Hypothesis are driven by domain knowledge or work experience

• Not very different from the problems business analysts address !vBusinesses address the same problems they have been working on, which are

determined in the very beginning before resources should be allocated. Data scientists do not necessarily contribute to initiating new problems.

Data Science

Insight

Model

Insight

Model

Data Product

Product

Analysts

Outcomes

Skills Portfolio

Data scientists use three kinds of languages: analysis (R- Matlab), scripting (python, perl), data processing (sql, pig) !Analytical environments should allow integration of languages / capabilities they offer. !Every analyst has their preferred language / method - defaults to using their own for analytical efforts. True within centralized analytical teams.

Skills

Discovery Maturity

• Discovery is poorly understood and little recognized as a capability. It is rarely mentioned by any of the Data Science / Analytics professionals spoken with. When mentioned, it is seen as a small-scale activity and / or a desired outcome of particular projects, not something the organization needs to be able to in an ongoing / comprehensive / large-scale fashion such as understanding customers. !

• Data scientists understand their own challenges in terms of what stages / aspects of a data-centric workflow require greatest time, effort, or present most complexity or potential for introducing uncertainty / ambiguity into the efforts. Broader framings are the need for or desire to work on data-driven products, or transform and improve business through offering data-centered insights. !

• Product-centric data scientists (aim directly at making data-driven offerings) are a small minority of the active community. Many more are engineers with strong data skills, and many more analysts trying to acquire data science skills / perspective.

Supporting Factors

• Regardless of particulars, the core ingredients remain the same: analytical skills and perspective, domain knowledge, engineering / tooling skills and perspective !

• In data science practices, analysis is always enabled by engineering - either localized to the data science team, or centrally provided via IT. !

• In BI practices, analysis is always enabled by IT and systems consultants / integrators (in house or external). !

• Leading DS groups rely on a number of hybrid approaches to support data cleansing and the evaluation of models, insights, and results - e.g. crowd source prep of data and checking of results for prototypes and experiments. !

• Data scientists rarely productionize code, analytical workflows, analytical tools. Engineers / IT convert 'prototype' artifacts created by data scientists into production code / tools.

Perspective

Analytical The analytical perspective is the center of definition for all analytical roles. Contrast with engineers, who "make stuff". Analytical roles figure things out for some purpose: whether a model to inform a product prototype or provide insight.

!

Empirical The empirical perspective is distinct from the analytical perspective, and marks 'true' data scientists. This revolves around framing and testing hypotheses formally and informally, often requires validation and interrogation of experimental methods and results by others, expects significant degree of transparency at (all) stages of the analytical effort.

Cooperation and Collaboration

• Discovery efforts are structured as individual efforts - insights come from individual analytical engagement with data sets. !

• Collaboration between analysts is asynchronous. !

• Diversity of analytical tools / languages in practice = barrier to cooperation and collaboration. !

• There is little re-use of analytical insights by analysts to further other efforts. !

• When tools and/or problem domains are stable / known, analysts create individual and group assets for reuse - e.g. R script libraries, code snippets for SAS, templates for data set file formats and structures !

• Intermediate work products created during analytical work (data sets / subsets, code, analytical scripts, algorithms, interim results, hypotheses,) perceived as often irrelevant or throwaway, if not outright wrong. Little investment is made to annotate / preserve intermediate work products for individual or group re-use, sharing, review.

THE MANY SHADES OF COLLABORATION

Independent: Have-it-all type data scientist (I know, I design & I implement)

Linear: Complementary (Analysts know, data scientists design, engineers implement)

Project-based: The missing piece ( Data scientists lead or support engineers)

Consultancy: From abstract to concrete (Some data scientists know & design, some other data scientists implement)

Data Landscape

• The physical location of data - where stored / what environment - is a significant cost factor for almost all aspects of analytical work. !

• Distributed data (managed / located in multiple stores) increases costs for many individual steps in analytical workflows. !

• Distributed data costs often = barrier to conducting insightful analysis using multiple techniques / steps. Default to basic / simple analysis to avoid high effort / low probability of success. !

• For analysts with low levels of db / data wrangling skill, even marginal distributed data costs = preventative barrier for engaging with data. !

• Most analysts reported having to migrate all of the data sets into the same data processing framework to begin analysis. [If all the data were in one place...]

DATA NATURE

• Messy: various forms (Web logs, web pages, genome data, sales revenues….)

• Scattered: Data scientists have to search from the wild (outside of enterprise databases)

• Started “Big”, ended “Lean”: Meaningful data units are small in size

• Standardization is key to all data science work: why engineers become data scientists

!v Data scientists are “data foragers“ and “data format equalizers”. They have the ability

to manipulate large data sets and gradually narrow the data sets down to the exact units needed for analysis.

Algorithms and Analytical Tools

• Well-known algorithms and methods are used to plan and structure experiments, discover insights, drive the creation of new models, evaluate the effectiveness of new models & products. !

• The algorithm and method are often determined by domain, such as TF-IDF for IR, Smith-Waterman for bioinformatics,

PROCESS NATURE

• Wicked: Solutions are often times hardly pre-defined

• Iterative three-step cycle: Data collection, data cleansing, & data analysis

• trial-and-error: Hypotheses revision, hypotheses validation, & data recollection

• Ad-hoc analysis chance encountering

!v Data scientists provide new perspectives to address old problems. The path to the

solution is usually exploratory. But the goal has always been clear and pre-defined.

Data Science Workflows

http://strata.oreilly.com/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html

http://strata.oreilly.com/2013/09/data-analysis-just-one-component-of-the-data-science-workflow.html

Data Science Workflows

Data Science Workflow

• Frame problem / goal of effort • Identify and extract data to be used in effort from whole corpus / totality of available data

• Exploratory identification and selection of working data for use in experiments

• Define experiment(s): hypothesis / null hypothesis, methods, success criteria • Derive insight(s)

• Wrangle, process, visualize, interpret • Codify / create new model reflecting insights outcomes from experiments • Validate new model(s) • Provision training data • Train new model • Validation and outcome of training model • Hand-off for implementation on production systems / as production code

Analysis Workflow & Activities

• Empirical analysis of subsets of data • Understand topology of data, boundaries (sets / subsets, complete corpus,

totality of data) • Outlier identification and profiling

• How significant are outliers to overall topology • Comparative exclusion and profiling of resulting data subsets to understand their role,

discover principal components

• Find and analyze patterns, areas of interestingness / deserving attention • Find and analyze central actors / factors (in existing model that produced

source data, in topology of working data, in patterns, etc.) • ID and understand their impact on local and global data topology and primary metrics if in

several ways / more than one axis / at the same time

• Discover and analyze relationships amongst central actors • Understand cycles, trends, changes (dynamic characteristics) for core

actors, topology, patterns and structure • Understand causal factors

• Codify / create new model reflecting insights & outcomes from experiments

• dynamic working data sets & subset • iterative • experimental frame

Key Workflows

Insight Consumer <> Data Scientist originate, define, address discovery effort

!Data Scientist > Data Engineer

create & evolve apps to address new & in-progress efforts !Analyst <> Analyst

define & address in-progress discovery efforts !Data Scientist > internal networks

create & curate archive & community

Needs

What are the most common and useful statistical techniques you use during discovery and analysis efforts? !What statistical capabilities or functions would be very useful if provided within discovery applications, and where would they be useful?

“(1) The most commonly used sta0s0cal techniques used to date (in our strategic planning work) are: dimensionality reduc0on (par00on clustering, mul0ple correspondence analysis), factor analysis, par00on clustering (k-‐means, k-‐medoids, fuzzy clustering), cluster valida0on techniques (silhoueMe, dunn’s index, connec0vity), mul0variate outlier detec0on, linear regression, and logis0c regression.” !(2) Techniques that would assist with iden0fying outliers or invalid data. Much of this work seems to be done by hand. I believe that we are also geOng to the point where we could start using linear regression and splines (for showing trends).”

Needs

For example, would system-generated descriptive statistical visualizations be useful for whole data sets - or for smaller user-selected groups of attributes? !Would it be useful for the application to analyze and suggest possible distribution models it sees in the data; for the values of individual attributes, and/or for larger sets of data?

“With regards to your last ques0on on visualiza0on, we have put in significant effort to use visualiza0on in our Endeca installa0on. We have built visualiza0ons such as tree maps, flow diagrams, sun burst diagrams, scaMer plots showing clusters, and hierarchical edge bundling diagrams to explore our data sets. !Our data tends to be qualita0ve rather than quan0ta0ve so this drives much of our visualiza0ons. !So yes, interac0ve descrip0ve sta0s0cal visualiza0on would be helpful – on the complete data set and individual aMributes.”

Needs

1. What are the most common statistical techniques you use at work - descriptive, inferential, or otherwise? What are the most valuable? !2. What are the most common visualizations you use to present findings or share insights? What are the most valuable?

“(1) We do a lot of chi-square tests, permutation tests, false discovery rate correction, Bonferroni correction, 2x2 Fisher exact test, logistic regression. !!I also use SVM, Artificial Neural Networks (ANN), Naive-Bayes Classifiers (NBC), parts of speech taggers.”!!(2) ROC curves, tables with p-values or odds ratios or hazard ratio (http://en.wikipedia.org/wiki/Hazard_ratio)!!Things p-value!XYZ1 0.001!XYZ2 ...!etc.”

http://en.wikipedia.org/wiki/Hazard_ratio

Needs

1. What are the most common statistical techniques you use at work - descriptive, inferential, or otherwise? What are the most valuable? !2. What are the most common visualizations you use to present findings or share insights? What are the most valuable? !“Logistic Regression, Decision Trees, Markov Models, Area Under Curve”

Casual Analyst

Analytical Manager

Data Skills Level

Customize Models

Low / none

High

Composition CapabilityLow / Use High / Make

Create New Models

Create Complex Models

Analyst

Sense Makers: Information Management Ability

Use Models

Problem Solver

Data Scientist

Materials• http://www.datasciencecentral.com/ • Ben Lorica’s blog: http://strata.oreilly.com/ben • https://blog.twitter.com/tags/twitter-data • http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853

http://strata.oreilly.com/ben

https://blog.twitter.com/tags/twitter-data

http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853

Algorithms (ex: computational complexity, CS theory) Back-End Programming (ex: JAVA/Rails/Objective C) Bayesian/Monte-Carlo Statistics (ex: MCMC, BUGS) Big and Distributed Data (ex: Hadoop, Map/Reduce) Business (ex: management, business development, budgeting) Classical Statistics (ex: general linear model, ANOVA) Data Manipulation (ex: regexes, R, SAS, web scraping) Front-End Programming (ex: JavaScript, HTML, CSS) Graphical Models (ex: social networks, Bayes networks) Machine Learning (ex: decision trees, neural nets, SVM, clustering) Math (ex: linear algebra, real analysis, calculus) Optimization (ex: linear, integer, convex, global) Product Development (ex: design, project management) Science (ex: experimental design, technical writing/publishing) Simulation (ex: discrete, agent-based, continuous) Spatial Statistics (ex: geographic covariates, GIS) Structured Data (ex: SQL, JSON, XML) Surveys and Marketing (ex: multinomial modeling) Systems Administration (ex: *nix, DBA, cloud tech.) Temporal Statistics (ex: forecasting, time-series analysis) Unstructured Data (ex: noSQL, text mining) Visualization (ex: statistical graphics, mapping, web-based dataviz)

Skills

Figure 3-3. There were interesting partial correlations among each respondent’s primary Skills Group (rows) and primary Self-ID Group!(columns). The mosaic plot illustrates the proportions of respondents!who fell into each combination of groups. For example, there were few!Data Researchers whose top Skill Group was Programming.

Skills

data science highlights

Technology