big data and data science challenges (by mona soliman habib)

22
Big Data & Data Science: A Practical View Mona Soliman Habib Principal Data Scientist Microsoft Azure Machine Learning

Upload: anton-konushin

Post on 17-Aug-2015

148 views

Category:

Science


2 download

TRANSCRIPT

Page 1: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Big Data & Data Science:A Practical View

Mona Soliman HabibPrincipal Data ScientistMicrosoft Azure Machine Learning

Page 2: Big Data and Data Science Challenges  (by Mona Soliman Habib)
Page 3: Big Data and Data Science Challenges  (by Mona Soliman Habib)

1854 London data map

Page 4: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Data gives you the WHAT, so people can uncover the WHY

4

Page 5: Big Data and Data Science Challenges  (by Mona Soliman Habib)

• Data is everywhere

• Small data, medium data, large data, huge data

• Data of all shapes and forms

• Internal, external, public, crowdsourced

• Numeric, free text, spatial, temporal, audio/video, ...

• Structured, semi-structured, unstructured

• Encrypted and decrypted

• Private, personal, sensitive, etc.

The Data Deluge

Page 6: Big Data and Data Science Challenges  (by Mona Soliman Habib)

• Large, complex, challenging, difficult to process, …

• Big Data is not limited to size

• Five main concerns: The 5 V’s • Volume: amount of data is growing from terabytes to petabytes and more

• Velocity: data is being collected at a very fast pace

• Variety: data can be of any type, regardless of structure nature

• Veracity: lack of trust in information extracted from Big Data

• Value: is the data collection and curation worth it?

• Like it or not, data will continue to grow in all aspects.

• Data Insight Prediction Impactful Action

What is Big Data?

Page 7: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Anyone can benefit from data

Page 8: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Transformational trends

cloud computing

2011 2016 5x increase

emerging data science talent

Universities filling 300,000 US talent gap

90% of the data in the world today has been created in the last two years alone

data explosion

connected customers

1B+200M10.4M 160M

Page 9: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Cultural, technological, and scholarly phenomenon

Assumptions, biases, uncertainties

Is Big Data a part of mythology?? "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy"1.

Food for serious thoughts1: Big Data changes the definition of knowledge

Claims to objectivity and accuracy are misleading

Bigger data are not always better data

Taken out of context, Big Data loses its meaning

Just because it’s accessible doesn’t make it ethical

Limited access to Big Data creates new digital divides

Critical questions for Big Data

Boyd, D.; Crawford, K. (2012). "Critical Questions for Big Data". Information, Communication & Society 15 (5): 662.

Page 10: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Data Scientist: The Sexiest Job of the 21st CenturyHBR, October 2012

Page 11: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Data science

• is the study of the generalizable extraction of knowledge

from data (Wikipedia)

• is getting predictive and/or actionable insight from data

(Neil Raden)

• involves extracting, creating, and processing data to turn

it into business value. – Vincent Granville (Developing

Analytic Talent: Becoming a Data Scientist )

What is Data Science?

Page 12: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Real

World

Machine

Learning

Data Science

Data Science is the practice of derivinginformation and insight from real-worlddata to create business value.

Data Science: Practical Definition

Page 13: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Problem Requirements

Available data• Related to the decision

• Historical

• Outcomes

Valuable business problem involving decision

• Existing process

• Metrics

Page 14: Big Data and Data Science Challenges  (by Mona Soliman Habib)

The Data Science Process

Page 15: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Where AA sits:Transform & Analyze

Internal &

external

DashboardsReports Ask Mobile

Information

managementOrchestration

Extract, transform,

load Prediction

Relational Non-relational Analytical

Apps

Streaming

Data

Page 16: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Source: http://www.edureka.in/blog/core-data-scientist-skills/

Data Scientist: Essential Skills

Page 17: Big Data and Data Science Challenges  (by Mona Soliman Habib)

• 80% of the work is janitorial• Data movement, consolidation, curation, wrangling, etc.

• Working with large data• How much data is really need for modeling?

• Data analysis, visualization, exploration

• Smart, representative sampling vs. large scale learning

• Customers don’t trust black-box models• How to interpret and react to model predictions?

• Selecting and understanding appropriate metrics

• Proper model integration in business workflows

• Post-deployment monitoring and updates• Monitoring online performance, detecting drifts, model updates, A/B testing, etc.

• Managing data science projects• How to manage projects involving data, machine learning, software apps, services, etc.?

Big Data & Data Science Challenges

Page 18: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Data Science Research

Source: DSRC http://www.slideshare.net/dsrc/data-science-research-center-overview-and-mission

Page 19: Big Data and Data Science Challenges  (by Mona Soliman Habib)

Data Science Research

Page 20: Big Data and Data Science Challenges  (by Mona Soliman Habib)

• Exploration and visualization• Auto-summarization

• Visualizing big data

• Exploring unstructured data

• Auto-processing of data• Data quality assessment

• Data curation

• Smart data consolidation

• Auto-modeling• Smart model selection

• Metrics selection

• Model transformation

Many open areas for research• Model interpretability• Feature importance

• Per instance interpretation

• What-if analysis

• Auto-maintenance• Active learning / Machine teaching

• Smart monitoring and testing

• Data/Software Engineering• Data model/schema management

• Data version control

• Model version control

• ML project management

• Security and privacy

Page 21: Big Data and Data Science Challenges  (by Mona Soliman Habib)

•Understand the decision process

•Establish performance metrics early

•Keep the human in the loop

•Consider availability (and timing) of data

•Bad data happens

•User Interface is important

•Adopt a software version control system

•Implementing solutions take longer

•On-going support is not negligible

•Devil is in the details

10 practical lessons learned

Page 22: Big Data and Data Science Challenges  (by Mona Soliman Habib)