big data talking stories in healthcare

Azure Data Platform Services

Mostafa Elzoghbi

http://mostafa.rocks

Twitter: @MostafaElzoghbi

Session takeaways and objectives

• Azure Data Platform Services• HDInsight Clusters in Azure

• Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog

• Data Transformations: Apache Storm, Apache Spark, Azure Data Factory

• Healthcare / Life Sciences Use Cases

Azure Data Platform Services

HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack that is the go-to solution for big data analysis.

It includes implementations of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari, and so on.

HDInsight also integrates with business intelligence (BI) tools such as Power BI, Excel, SQL Server Analysis Services, and SQL Server Reporting Services.

HDInsight is available on Windows and Linux

HDInsight on Linux: A Hadoop cluster on Ubuntu

HDInsight on Windows: A Hadoop cluster on Win Server 2012 R2

What is HDInsight

HDInsight provides cluster Types & custom configurations for:

• Hadoop (HDFS)

• HBase

• Storm

• Spark

• R Server

• Hive Interactive, Kafka (Preview)

HDInsight has powerful programming extensions for languages including C#, Java, and .NET.

Business Value Proposition: Skip maintaining and purchasing hardware

HDInsight clusters on Azure

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-develop-deploy-streaming-jobs/

https://azure.microsoft.com/en-us/documentation/articles/hdinsight-develop-deploy-java-mapreduce/

http://go.microsoft.com/fwlink/?linkid=619287&clcid=0x409

HDInsight clusters on Azure

Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable.

HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families

Data is stored in the rows of a table, and data within a row is grouped by column family.

The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.

What is HBase

Order No Customer

Name

Customer

Phone

Company Name Company

Address

12012015 Mostafa 101-232-2345 Microsoft Redmond, WA

Customer Company

Order No Customer

Name

Customer

Phone

Company Name Company

Address

12012015 Mostafa 101-232-2345 Microsoft Redmond, WA

HBase Commands:

create Equivalent to Create table in T-SQL

get Equivalent to Select statements in T-SQL

put Equivalent to Update, Insert statement in T-SQL

scan Equivalent to Select (no where condition) in T-SQL

HBase shell is your query tool to execute in CRUD commands to a HBase cluster.

Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API.

An HBase database can also be queried by using Hive using SQLHive.

What is HBase

Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of data by using HiveQL (a query language similar to SQL).

Hive understands how to work with structured and semi-structured data, such as text files where the fields are delimited by specific characters.

Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.

Hive can also be extended through user-defined functions (UDF).

A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL.

Support for multiple execution engine: WebHCat, HiveServer2 (faster) or Tez.

What is Hive

Apache Storm is a distributed, fault-tolerant, open-source computation system that allows you to process data in real-time with Hadoop.

Apache Storm on HDInsight allows you to create distributed, real-time analytics solutions in the Azure environment by using Apache Hadoop.

Ability to write Storm components in C#, JAVA and Python.

Azure Scale up or Scale down without an impact for running Storm topologies.

Ease of provision and use in Azure portal & development templates in Visual Studio.

What is Apache Storm

Apache Storm apps are submitted as Topologies.

A topology is a graph of computation that processes streams

Stream: An unbound collection of tuples. Streams are produced by spouts and bolts, and they are consumed by bolts.

Tuple: A named list of dynamically typed values.

Spout: Consumes data from a data source and emits one or more streams.

Bolt: Consumes streams, performs processing on tuples, and may emit streams. Bolts are also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a blob, or other data store.

Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.

Apache Storm Components

Apache Spark™ is a fast and general engine for large-scale data processing.

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Write applications quickly in Java, Scala, Python, R.

Combine SQL, streaming, and complex analytics.

Spark's in-memory computation capabilities

make it a good choice for iterative algorithms in

ML and graph computations.

Support for R Server & Azure Data Lake.

What is Apache Spark

Part 3: Single Slide

DEMO –Apache Spark

Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages.

It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics.

Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance.

Data Lake Analytics—a no-limits analytics job service to power intelligent action

Azure Data Lake (ADL)

Azure Data Lake (ADL)

Azure Data Lake Analytics is a new service, built to make big data analytics easy.

Key capabilities:

Dynamic scaling: do analytics on terabytes or even exabytes of data

Develop faster, debug, and optimize smarter using familiar tools: U-SQL Jobs.

U-SQL: simple and familiar, powerful, and extensible: SQL language with the power of expressive C#, process & analyze data with skills you already have.

Affordable and cost effective.

Works with all your Azure Data: Data Lake Analytics is optimized to work with Azure Data Lake - providing the highest level of performance, throughput, and parallelization for your big data workloads. Data Lake Analytics can also work with Azure Blob storage and Azure SQL database.

Azure Data Lake Analytics

Part 3: Single Slide

DEMO –Azure Data Lake

Healthcare &Pharmaceuticals Use Cases

Provider of enzymes for Food, Pharma Industry

PharmaceuticalsGlobal provider of natural ingredients such as cultures and enzymes for the food, pharmaceutical, nutritional, and agricultural industries.

Part 1: What They Did | Gene Analysis

ChallengeCollect clinical trial data from electronic laboratory notebooks (ELNs)

Collect structured and unstructured data from sources like automated equipment (robots, temperature sensors, and other devices)

Need to do analysis and look for patterns on this data

SolutionChose Azure HDInsight, SQL Server on-premises

Examine patterns in chemical composition and physical properties of cultures and enzymes e.g. examine the gene composition of our bacteria and how it actually behaves in yogurt

Gene Analysis

BK1

Provider of enzymes for Food, Pharma Industry

Part 2: How They Did It | Gene Analysis

How They Did ItCollect data in Azure Blobs• Extract data from all the files (GBs in size)

• Transpose into JSON using .NET

HDInsight processes data for insights• Hive is used to run queries

• Mainly Select statements (joins/unions)

• Maximum 20 lines of code

Use SQL Server for reporting

Use Hive ODBC connector

Gene Analysis

SQL ServerOn-premises

Lab Information Management

System

Automated equipment (robots, temperature sensors, etc)

Healthcare Application Provider

HealthcareLeading provider of healthcare software applications for clinical, financial, pharmacy, etc.

Part 1: What They Did | Data Lake to deliver better analytics

ChallengeMultiple healthcare applications operating in product silos

Want to implement a data lake and provide distributed processing for better analytics and predictions for their end customers:

• Population, risk, and Care management

• Clinical decision support using predictions

• Real time quality measures to assist providers reach their regulatory requirements

• Capacity management predictions

• Enrich clinical data with NLP on unstructured physicians notes to reduce over/under treatment, readmissions, and faster claims processing

SolutionChose Azure HDInsight, HBase, and custom .NET application

Begin by storing all medical records in Azure

Acquire, clean data and insert into data lake

Data LakeAnalytics

BK1

Healthcare Application Provider

Part 2: How They Did It | Data Lake to deliver better analytics

How They Did ItCollect data in Azure Blobs• Have medical records in JSON format

HDInsight processes data for insights• MapReduce preprocesses data

• used to insert data into data lake

• Developed .NET application to interface with HBase

Data Lake Analytics

Azure Blobs

Azure HDInsight

Medical Records in JSON

.NET Application

Insert into Blobs

References

• HDInsight Documentation (R, Storm, Spark, Kafka, ADL,..etc)

https://azure.microsoft.com/en-us/services/hdinsight/

• Spark Programming Guide

http://spark.apache.org/docs/latest/programming-guide.html

• edx.org: Free Apache Spark courses

• Get started with Data Science VMs in Azure

https://blogs.technet.microsoft.com/machinelearning/tag/dsvm/

https://blogs.technet.microsoft.com/machinelearning/tag/dsvm/

big data talking stories in healthcare

Technology