big data and data science challenges (by mona soliman habib)
TRANSCRIPT
![Page 1: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/1.jpg)
Big Data & Data Science:A Practical View
Mona Soliman HabibPrincipal Data ScientistMicrosoft Azure Machine Learning
![Page 2: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/2.jpg)
![Page 3: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/3.jpg)
1854 London data map
![Page 4: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/4.jpg)
Data gives you the WHAT, so people can uncover the WHY
4
![Page 5: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/5.jpg)
• Data is everywhere
• Small data, medium data, large data, huge data
• Data of all shapes and forms
• Internal, external, public, crowdsourced
• Numeric, free text, spatial, temporal, audio/video, ...
• Structured, semi-structured, unstructured
• Encrypted and decrypted
• Private, personal, sensitive, etc.
The Data Deluge
![Page 6: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/6.jpg)
• Large, complex, challenging, difficult to process, …
• Big Data is not limited to size
• Five main concerns: The 5 V’s • Volume: amount of data is growing from terabytes to petabytes and more
• Velocity: data is being collected at a very fast pace
• Variety: data can be of any type, regardless of structure nature
• Veracity: lack of trust in information extracted from Big Data
• Value: is the data collection and curation worth it?
• Like it or not, data will continue to grow in all aspects.
• Data Insight Prediction Impactful Action
What is Big Data?
![Page 7: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/7.jpg)
Anyone can benefit from data
![Page 8: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/8.jpg)
Transformational trends
cloud computing
2011 2016 5x increase
emerging data science talent
Universities filling 300,000 US talent gap
90% of the data in the world today has been created in the last two years alone
data explosion
connected customers
1B+200M10.4M 160M
![Page 9: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/9.jpg)
Cultural, technological, and scholarly phenomenon
Assumptions, biases, uncertainties
Is Big Data a part of mythology?? "large data sets offer a higher form of intelligence and knowledge [...], with the aura of truth, objectivity, and accuracy"1.
Food for serious thoughts1: Big Data changes the definition of knowledge
Claims to objectivity and accuracy are misleading
Bigger data are not always better data
Taken out of context, Big Data loses its meaning
Just because it’s accessible doesn’t make it ethical
Limited access to Big Data creates new digital divides
Critical questions for Big Data
Boyd, D.; Crawford, K. (2012). "Critical Questions for Big Data". Information, Communication & Society 15 (5): 662.
![Page 10: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/10.jpg)
Data Scientist: The Sexiest Job of the 21st CenturyHBR, October 2012
![Page 11: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/11.jpg)
Data science
• is the study of the generalizable extraction of knowledge
from data (Wikipedia)
• is getting predictive and/or actionable insight from data
(Neil Raden)
• involves extracting, creating, and processing data to turn
it into business value. – Vincent Granville (Developing
Analytic Talent: Becoming a Data Scientist )
What is Data Science?
![Page 12: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/12.jpg)
Real
World
Machine
Learning
Data Science
Data Science is the practice of derivinginformation and insight from real-worlddata to create business value.
Data Science: Practical Definition
![Page 13: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/13.jpg)
Problem Requirements
Available data• Related to the decision
• Historical
• Outcomes
Valuable business problem involving decision
• Existing process
• Metrics
![Page 14: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/14.jpg)
The Data Science Process
![Page 15: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/15.jpg)
Where AA sits:Transform & Analyze
Internal &
external
DashboardsReports Ask Mobile
Information
managementOrchestration
Extract, transform,
load Prediction
Relational Non-relational Analytical
Apps
Streaming
Data
![Page 16: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/16.jpg)
Source: http://www.edureka.in/blog/core-data-scientist-skills/
Data Scientist: Essential Skills
![Page 17: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/17.jpg)
• 80% of the work is janitorial• Data movement, consolidation, curation, wrangling, etc.
• Working with large data• How much data is really need for modeling?
• Data analysis, visualization, exploration
• Smart, representative sampling vs. large scale learning
• Customers don’t trust black-box models• How to interpret and react to model predictions?
• Selecting and understanding appropriate metrics
• Proper model integration in business workflows
• Post-deployment monitoring and updates• Monitoring online performance, detecting drifts, model updates, A/B testing, etc.
• Managing data science projects• How to manage projects involving data, machine learning, software apps, services, etc.?
Big Data & Data Science Challenges
![Page 18: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/18.jpg)
Data Science Research
Source: DSRC http://www.slideshare.net/dsrc/data-science-research-center-overview-and-mission
![Page 19: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/19.jpg)
Data Science Research
![Page 20: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/20.jpg)
• Exploration and visualization• Auto-summarization
• Visualizing big data
• Exploring unstructured data
• Auto-processing of data• Data quality assessment
• Data curation
• Smart data consolidation
• Auto-modeling• Smart model selection
• Metrics selection
• Model transformation
Many open areas for research• Model interpretability• Feature importance
• Per instance interpretation
• What-if analysis
• Auto-maintenance• Active learning / Machine teaching
• Smart monitoring and testing
• Data/Software Engineering• Data model/schema management
• Data version control
• Model version control
• ML project management
• Security and privacy
![Page 21: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/21.jpg)
•Understand the decision process
•Establish performance metrics early
•Keep the human in the loop
•Consider availability (and timing) of data
•Bad data happens
•User Interface is important
•Adopt a software version control system
•Implementing solutions take longer
•On-going support is not negligible
•Devil is in the details
10 practical lessons learned
![Page 22: Big Data and Data Science Challenges (by Mona Soliman Habib)](https://reader034.vdocument.in/reader034/viewer/2022042608/55d14c32bb61eb34578b4821/html5/thumbnails/22.jpg)