starting the hadoop journey at a global leader in cancer research

29
Vamshi Punugoti & Bryan Lari MD Anderson Cancer Center June 2016 HDP @ MD ANDERSON Starting the Hadoop Journey at a Global Leader in Cancer Research

Upload: hadoop-summit

Post on 09-Jan-2017

90 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Starting the Hadoop Journey at a Global Leader in Cancer Research

Vamshi Punugoti & Bryan LariMD Anderson Cancer Center

June 2016

HDP @ MD ANDERSONStarting the Hadoop Journey at a Global

Leaderin Cancer Research

Page 2: Starting the Hadoop Journey at a Global Leader in Cancer Research

Agenda

• About MD Anderson• Big Data Program• Our Hadoop Implementation• Lessons Learned• Next Steps

Page 3: Starting the Hadoop Journey at a Global Leader in Cancer Research

• Who we are– One of the worlds largest centers devoted exclusively to cancer care– Created by the Texas legislature in 1941– Named one of the nation's top two hospitals for cancer care every

year since the survey began in 1990

• Mission– MD Anderson’s mission is to eliminate cancer in Texas, the nation and

the world through exceptional programs that integrate patient care, research and prevention.

About MD Anderson

Page 4: Starting the Hadoop Journey at a Global Leader in Cancer Research

About MD Anderson cont.Patient Care Education

Research

Page 5: Starting the Hadoop Journey at a Global Leader in Cancer Research

Moon Shots Program

• Launched in 2012 – to make a giant leap for patients• Accelerating the pace of converting scientific discoveries into

clinical advances that reduce cancer deaths• Transdisciplinary team-science approach • Transformative professional platforms

List of Moon Shots12 Total Moon Shots B-cell Lymphoma Lung Cancer Breast Cancer Melanoma Colorectal Cancer Multiple Myeloma Glioblastoma Ovarian Cancer HPV-Related Cancers Pancreatic Cancer Leukemia (CLL, MDS, AML) Prostate Cancer

http://www.cancermoonshots.org

Page 6: Starting the Hadoop Journey at a Global Leader in Cancer Research
Page 7: Starting the Hadoop Journey at a Global Leader in Cancer Research

VolumeVarietyVelocityVeracity

Page 8: Starting the Hadoop Journey at a Global Leader in Cancer Research

Gulf of Mexico Analogy

Page 9: Starting the Hadoop Journey at a Global Leader in Cancer Research

Goals of Big Data Program• Data driven organization• All “types” of data• “Access” for all customers

• Clinicians• Researchers• Administrative / Operational

• Enable discovery of “insights”• Improve patient care• Increase research discoveries• Improve operations

• Govern data like an asset• Provide a platform / environment to enable all these things

Page 10: Starting the Hadoop Journey at a Global Leader in Cancer Research

To provide the right information to the right people at the right time with the right tools

Goaldata

insight

Page 11: Starting the Hadoop Journey at a Global Leader in Cancer Research

Insights

Page 12: Starting the Hadoop Journey at a Global Leader in Cancer Research

Make big data additive and build upon foundation

Page 13: Starting the Hadoop Journey at a Global Leader in Cancer Research

What are we doing today?• FIRE Enterprise Data Warehouse• Natural Language Processing (NLP)• Data Governance• Hadoop NoSQL• Cognitive Computing• Data Visualization• Evolving our Platform / Architecture• Identifying big data use cases• Training & Skills

Page 14: Starting the Hadoop Journey at a Global Leader in Cancer Research

• Federated Institutional Reporting Environment• Centralized data repository supporting analytics,

decision making, and business intelligence• Central repository for historical and operational data• Break-down data silos

Enterprise RepositorySource Systems

Dashboards

KPI’s

Analytic Reports

Analytics& Reporting

Discoveries

Improve

Patient Care

Quality / Perf

Improvements

Genomic

FIRE Program

Radiology

Labs

Epic / Clarity

Legacy Systems

Page 15: Starting the Hadoop Journey at a Global Leader in Cancer Research

• Vast amounts of unstructured data are stored on MDACC servers.

• Conventional ETL tools are not designed to mine unstructured data.

• Suite of tools make up the NLP Pipeline• Dictionaries were created to help Epic go-

live (Provider Friendly Terminology)• Other examples:

• Diagnosis from the pathology reports• Comorbidities• Family Cancer History• Cytogenetics• Obituary text• ICD10 Coding• Structured results feeding Moonshot TRA and OEA• Etc.

IBM ECM NLP

Engine

Unstructured Data Sources

Post NLP Database

HDWF (FIRE)

NLP Pipeline - Overview

Page 16: Starting the Hadoop Journey at a Global Leader in Cancer Research

Enterprise Business

Clinical Big Data

Peoplesoft

Systems of Record

Systems of Reporting

Systems of Insights

Kronos

Point of Sale

Volunteer Services

Rotary HouseMyHR

UTPD

Facilities

Clinic Station

EpicLab

GE IDX

Cerner

CARE

EPM

Hyperion

Oracle Business Intelligence

Smart View

Web Analytics

FIRE

EIW

Business Objects

Crystal

Hyperion Interactive Reporting

Facebook

Twitter

UPS

Center for Disease Control

The Weather Channel

LinkedIN

Youtube

oracle.comYelp!

Reuters

Google

U.S. Census

Medical Devices

Medical Equipment

Building Controls

Campus Video

Real-time Location Service

Wayfinding

Data Visualization

Ad Hoc

Cognitive Computing

Big Data for Analytics & Cognitive Computing

Presentation

Cohort Explorer

Parking Garages

Pharmacy

ResearchLCDR

Melcore

GeminiIPCT

Page 17: Starting the Hadoop Journey at a Global Leader in Cancer Research

Data Governance

Data Stewardshi

p

Data Portal

Data Profiling

and Quality

Data Standardization

Compliance

Metadata and

Business Glossary

Master Data

Management

Page 18: Starting the Hadoop Journey at a Global Leader in Cancer Research

DataRepository

Dashboards

KPI’s

Analytic Reports

Analytics & Informatics

Discoveries

Improve

Patient Care

Quality / Perf

Improvements

Data Mgt & Operations

Data Lake

Data DiscoveryProfiling

Standards / Quality

Big Data (Structured and NoSQL)

Insight Apps

Genomic

Radiology

Labs

Epic / Clarity

Legacy Systems

Page 19: Starting the Hadoop Journey at a Global Leader in Cancer Research

Big Data – High Level

Page 20: Starting the Hadoop Journey at a Global Leader in Cancer Research

Big Data Technical Architecture

Page 21: Starting the Hadoop Journey at a Global Leader in Cancer Research

Our Hadoop Implementation

Page 22: Starting the Hadoop Journey at a Global Leader in Cancer Research

Our Hadoop Implementation cont.

Page 23: Starting the Hadoop Journey at a Global Leader in Cancer Research

Our Hadoop Implementation cont.

Average number of messages per day: 1,556,688Estimated amount of storage increase per day: 5.7 GBNumber of channels currently being used: 24Estimated daily message processing capacity: 4,320,000

Page 24: Starting the Hadoop Journey at a Global Leader in Cancer Research

Our Hadoop Implementation cont.Medical Device Data Flow

Data Source Data Capture MDA Big DataData Lake Access Portals(Analytics/Visualization)

Integration HUB Data ingestion

Processing Channels

HBase

Data Loader

Caps

ule

Capsule DB

Medical Device

End-Users

FIRE/Big Data

Cloverleaf Engine

Epic

TCP-based Data Listener - Flume

HIVE

PIG

HUNK

Sqoop

Validated HL7with Patient ID

(from Epic)

HL7

Raw HL7(from Capsule)

Cleanse &Transform

Raw HL7

Validated HL7

Page 25: Starting the Hadoop Journey at a Global Leader in Cancer Research

Our Hadoop Implementation cont.

Developer Workstation/Sandbox

SVN(source control server)

Bamboo(build server)

HDP Dev Cluster HDP QA Cluster HDP Prod Cluster

Daily Checkin/Checkout

Development Cycle

On Dev Lead Approval:Build, Unit Test, Deploy & Tag

On Successful UAT& Release Approval:

Deploy Per Last Successful

Build Tag

Smoke TestBefore Updating Task status

Periodic Integration & Validation:Build, Unit Test

& Notify On Error

Development Cycle

Deployment Cycle

Page 26: Starting the Hadoop Journey at a Global Leader in Cancer Research

process

1. It’s complex2. It’s a journey3. Leverage existing strengths4. Collaborate openly5. Learn from experts6. One cluster – multiple use cases7. Follow best practices

Lessons Learned – what went well

people

Page 27: Starting the Hadoop Journey at a Global Leader in Cancer Research

1. Continue to expand/evolve our platform2. Ingest more data and data types3. Identify high value use cases4. Develop/Train people with new skills

Next Steps

Page 28: Starting the Hadoop Journey at a Global Leader in Cancer Research

Train People with new Skills

Accessing dataComputing dataVisualizing data

Insights & Cognitive Computing

Page 29: Starting the Hadoop Journey at a Global Leader in Cancer Research