data science, big data and analytics: present and future

41

Upload: others

Post on 10-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science, Big Data and Analytics: Present and Future
Page 2: Data Science, Big Data and Analytics: Present and Future

Data Science, Big Data and Analytics: Present and Future

Perspectives from Academia, Industry and Consulting

Zoran Bursac, PhD, MPH

Josh Callaway, MS, MPH

Fernando Lopez, MS, PhD Candidate

Page 3: Data Science, Big Data and Analytics: Present and Future

Data

Health care, business, technology -> data

Big data -> voluminous data sets (structured or unstructured)

Produced every day all around us

Analytics -> examining data to detect patterns

Page 4: Data Science, Big Data and Analytics: Present and Future

Big Data Analytics and Data Science

Different sources, different sizes

High variety, volume, velocity

Online networks, web pages, audio/video, social media,

logs

Techniques-> machine learning, data mining,

natural language processing, statistics

Extraction, preparation, storage/warehousing,

blending, analytics

Page 5: Data Science, Big Data and Analytics: Present and Future

Real-time Benefits

Healthcare

Banking

Energy

Tech

Consumer

Education

Page 6: Data Science, Big Data and Analytics: Present and Future

Current Trends and Common Data Science ToolsProcess, Perform and Visualize

Data sourcing -> Hadoop HDFS

Data storing -> Oracle, MySql

Data conversion -> Sqoop

Data transformation ->

Hive

Exploratory analysis -> Elastic

search, knime

Model building -> R, SAS, Python,

Julia

Visualization -> Tableau, ggplot2

Model execution -> Hadoop, Java,

Spark, C#

Version control -> Git, BitBucket IDE -> Rstudio

Text for coding -> Jupyter Notebook,

R Shiny

Page 7: Data Science, Big Data and Analytics: Present and Future

Free Open Source

Hadoop->distributed processing of large data

across clusters

Hive->warehouse to manage large data in

distributed SQL storage

Kafka->real time pipeline of streaming

data

Pig->large data analytics

R, Rstudio, ggplot-> analytics and data

visualization

Python, Julia -> high level programming with efficient algorithms and

speed for large data processing

Jupyter notebook -> manage documents

such as code, explanatory and shared

RapidMiner -> data preparation, machine learning and model

deployment

Page 8: Data Science, Big Data and Analytics: Present and Future

Do you need to know all? NO

Hadoop R SQL Python Hive Pig etc…

Page 9: Data Science, Big Data and Analytics: Present and Future

• Cirillo and Valencia (2019). Machine learning algorithms for multi-view data analysis. Data from multiple sources (genomic, proteomic, metabolomic) used to identify associations within and between multiple sets of patients, and generate models for patient clustering.

Big data analytics for personalized medicine and pharmacogenomics

Page 10: Data Science, Big Data and Analytics: Present and Future

The Latest Buzzwords

Data science

Artificial intelligence

Machine learning

Data mining Big data

Data warehouse Data lake Cloud

computing Hadoop Internet of things

Page 11: Data Science, Big Data and Analytics: Present and Future

Local Environment

diagnosis.csv

demographic.csv

lab.csv medication.csv

procedure.csvRStudio Local

insertDB.R

AWS RDS MySQLTables

demographicdiagnosis

labproceduremedication

AWS EC2 InstanceShiny Server

AWS EC2 InstanceMicrosoft R Open

Online Published Web Apps with URLsModel

Results

results.csv

insertDB.R

Page 12: Data Science, Big Data and Analytics: Present and Future

www.Kaggle.com

Page 13: Data Science, Big Data and Analytics: Present and Future

Diabetes

Page 14: Data Science, Big Data and Analytics: Present and Future

Step 1. Local

Environment

Page 15: Data Science, Big Data and Analytics: Present and Future

Step 2. Insert into

Remote Database

Page 16: Data Science, Big Data and Analytics: Present and Future

Step 3. SQL Database

Page 17: Data Science, Big Data and Analytics: Present and Future

Step 4. R Shiny Web App

Page 18: Data Science, Big Data and Analytics: Present and Future

Step 4. R Shiny Web App

Page 19: Data Science, Big Data and Analytics: Present and Future

Local Storage

Text Files (.txt) Comma Separated Value Files (.csv)

Excel Database (.xlsx)

Microsoft Access Database (.accdb)

Page 20: Data Science, Big Data and Analytics: Present and Future

Remote Storage

Dropbox Google Sheets

AWS S3 Bucket

Cloud Database• MySQL• MongoDB• PostgreSQL• NoSQL• Oracle

Page 21: Data Science, Big Data and Analytics: Present and Future

Cloud Platforms

Page 22: Data Science, Big Data and Analytics: Present and Future
Page 23: Data Science, Big Data and Analytics: Present and Future

Microsoft Azure

Page 24: Data Science, Big Data and Analytics: Present and Future
Page 25: Data Science, Big Data and Analytics: Present and Future

DigitalOcean

Page 26: Data Science, Big Data and Analytics: Present and Future

Distributed Computing Technology

All processing jobs (scripts; i.e. R, Python, Scala, etc.) are divvied up among all available processing units (computers, cores, threads, etc.)

Hadoop

Spark

Microsoft R Open

Page 27: Data Science, Big Data and Analytics: Present and Future

Microsoft R Open

Multithreaded Performance

Page 28: Data Science, Big Data and Analytics: Present and Future

Machine Learning

Page 29: Data Science, Big Data and Analytics: Present and Future

Microsoft Azure Power BI

Page 30: Data Science, Big Data and Analytics: Present and Future

Microsoft Azure Power BI

Page 31: Data Science, Big Data and Analytics: Present and Future

Microsoft Azure Power BI

Page 32: Data Science, Big Data and Analytics: Present and Future

Analytics on big data

Data warehouse

Real-time analytics

Solution Architectures

at Sanitas

Population Health Management for Healthcare

Page 33: Data Science, Big Data and Analytics: Present and Future

Local Storage

BUSINESS INTELLIGENCE STRATEGYSanitas USA

Data Sources

IT Alignment

SANITAS USA – Strategic plan

Data Governance

- Standards and Policies - Data Quality (DQ) - Data Security and Privacy - Architecture/Integration - DW and Business Intelligence (BI) - Self-service Architectures

Data Types

ETL/ELT workflows Cloud data warehouse

Ensure Data Quality

Dashboards, KPIs and Metrics

eCW Report Delivery

Reports

Security, Roles and Audit

Visualize and Report

Operations

Information Analysis for Business unitsClinical – Financial - Operational

Visualized Analytics

Training and Predictive Experimentation

Hadoop, Spark, Hive, LLAP, Kafka, Storm, R

Apache Spark-based analytics

Page 34: Data Science, Big Data and Analytics: Present and Future

ML StudioExperiment

Page 35: Data Science, Big Data and Analytics: Present and Future

ML StudioDataset

Page 36: Data Science, Big Data and Analytics: Present and Future

ML StudioModel Train

Page 37: Data Science, Big Data and Analytics: Present and Future

ML StudioModel Score

Page 38: Data Science, Big Data and Analytics: Present and Future

ML StudioModel

Evaluation

Page 39: Data Science, Big Data and Analytics: Present and Future

Power BI

Page 40: Data Science, Big Data and Analytics: Present and Future

Rejoinder

Data -> our biggest asset Emphasis on “good” data

Processing streams -> wrangling, carpenting

Storage –> lakes, warehousing, data bases

Analytics -> mining, machine learning

Output -> new knowledge, information, inferences

Feed back to the users -> gather more data

Page 41: Data Science, Big Data and Analytics: Present and Future

Question to the global audience

• What are your needs and where are you currently with respect to?

• Data collection, quality, storage, analytical and computing power

• Where is data coming from, single or multiple sources

• Who is maintaining data quality and fidelity• Do you have adequate storage with proper

security; planning for the future• Are you investing in resources and trained

personnel with data science skills