big data talking stories in healthcare
TRANSCRIPT
Session takeaways and objectives
• Azure Data Platform Services• HDInsight Clusters in Azure
• Data Storage: Apache Hive, Apache Hbase, Azure Data Catalog
• Data Transformations: Apache Storm, Apache Spark, Azure Data Factory
• Healthcare / Life Sciences Use Cases
HDInsight is a cloud implementation on Microsoft Azure of the rapidly expanding Apache Hadoop technology stack that is the go-to solution for big data analysis.
It includes implementations of Apache Spark, HBase, Storm, Pig, Hive, Sqoop, Oozie, Ambari, and so on.
HDInsight also integrates with business intelligence (BI) tools such as Power BI, Excel, SQL Server Analysis Services, and SQL Server Reporting Services.
HDInsight is available on Windows and Linux
HDInsight on Linux: A Hadoop cluster on Ubuntu
HDInsight on Windows: A Hadoop cluster on Win Server 2012 R2
What is HDInsight
HDInsight provides cluster Types & custom configurations for:
• Hadoop (HDFS)
• HBase
• Storm
• Spark
• R Server
• Hive Interactive, Kafka (Preview)
HDInsight has powerful programming extensions for languages including C#, Java, and .NET.
Business Value Proposition: Skip maintaining and purchasing hardware
HDInsight clusters on Azure
Apache HBase is an open-source, NoSQL database that is built on Hadoop and modeled after Google BigTable.
HBase provides random access and strong consistency for large amounts of unstructured and semistructured data in a schemaless database organized by column families
Data is stored in the rows of a table, and data within a row is grouped by column family.
The open-source code scales linearly to handle petabytes of data on thousands of nodes. It can rely on data redundancy, batch processing, and other features that are provided by distributed applications in the Hadoop ecosystem.
What is HBase
Order No Customer
Name
Customer
Phone
Company Name Company
Address
12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
Customer Company
Order No Customer
Name
Customer
Phone
Company Name Company
Address
12012015 Mostafa 101-232-2345 Microsoft Redmond, WA
HBase Commands:
create Equivalent to Create table in T-SQL
get Equivalent to Select statements in T-SQL
put Equivalent to Update, Insert statement in T-SQL
scan Equivalent to Select (no where condition) in T-SQL
HBase shell is your query tool to execute in CRUD commands to a HBase cluster.
Data can also be managed using the HBase C# API, which provides a client library on top of the HBase REST API.
An HBase database can also be queried by using Hive using SQLHive.
What is HBase
Apache Hive is a data warehouse system for Hadoop, which enables data summarization, querying, and analysis of data by using HiveQL (a query language similar to SQL).
Hive understands how to work with structured and semi-structured data, such as text files where the fields are delimited by specific characters.
Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.
Hive can also be extended through user-defined functions (UDF).
A UDF allows you to implement functionality or logic that isn't easily modeled in HiveQL.
Support for multiple execution engine: WebHCat, HiveServer2 (faster) or Tez.
What is Hive
Apache Storm is a distributed, fault-tolerant, open-source computation system that allows you to process data in real-time with Hadoop.
Apache Storm on HDInsight allows you to create distributed, real-time analytics solutions in the Azure environment by using Apache Hadoop.
Ability to write Storm components in C#, JAVA and Python.
Azure Scale up or Scale down without an impact for running Storm topologies.
Ease of provision and use in Azure portal & development templates in Visual Studio.
What is Apache Storm
Apache Storm apps are submitted as Topologies.
A topology is a graph of computation that processes streams
Stream: An unbound collection of tuples. Streams are produced by spouts and bolts, and they are consumed by bolts.
Tuple: A named list of dynamically typed values.
Spout: Consumes data from a data source and emits one or more streams.
Bolt: Consumes streams, performs processing on tuples, and may emit streams. Bolts are also responsible for writing data to external storage, such as a queue, HDInsight, HBase, a blob, or other data store.
Nimbus: JobTracker in Hadoop that distribute jobs, monitoring failures.
Apache Storm Components
Apache Spark™ is a fast and general engine for large-scale data processing.
Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.
Write applications quickly in Java, Scala, Python, R.
Combine SQL, streaming, and complex analytics.
Spark's in-memory computation capabilities
make it a good choice for iterative algorithms in
ML and graph computations.
Support for R Server & Azure Data Lake.
What is Apache Spark
Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape, and speed, and do all types of processing and analytics across platforms and languages.
It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics.
Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance.
Data Lake Analytics—a no-limits analytics job service to power intelligent action
Azure Data Lake (ADL)
Azure Data Lake Analytics is a new service, built to make big data analytics easy.
Key capabilities:
Dynamic scaling: do analytics on terabytes or even exabytes of data
Develop faster, debug, and optimize smarter using familiar tools: U-SQL Jobs.
U-SQL: simple and familiar, powerful, and extensible: SQL language with the power of expressive C#, process & analyze data with skills you already have.
Affordable and cost effective.
Works with all your Azure Data: Data Lake Analytics is optimized to work with Azure Data Lake - providing the highest level of performance, throughput, and parallelization for your big data workloads. Data Lake Analytics can also work with Azure Blob storage and Azure SQL database.
Azure Data Lake Analytics
Provider of enzymes for Food, Pharma Industry
PharmaceuticalsGlobal provider of natural ingredients such as cultures and enzymes for the food, pharmaceutical, nutritional, and agricultural industries.
Part 1: What They Did | Gene Analysis
ChallengeCollect clinical trial data from electronic laboratory notebooks (ELNs)
Collect structured and unstructured data from sources like automated equipment (robots, temperature sensors, and other devices)
Need to do analysis and look for patterns on this data
SolutionChose Azure HDInsight, SQL Server on-premises
Examine patterns in chemical composition and physical properties of cultures and enzymes e.g. examine the gene composition of our bacteria and how it actually behaves in yogurt
Gene Analysis
BK1
Provider of enzymes for Food, Pharma Industry
Part 2: How They Did It | Gene Analysis
How They Did ItCollect data in Azure Blobs• Extract data from all the files (GBs in size)
• Transpose into JSON using .NET
HDInsight processes data for insights• Hive is used to run queries
• Mainly Select statements (joins/unions)
• Maximum 20 lines of code
Use SQL Server for reporting
Use Hive ODBC connector
Gene Analysis
SQL ServerOn-premises
Lab Information Management
System
Automated equipment (robots, temperature sensors, etc)
Healthcare Application Provider
HealthcareLeading provider of healthcare software applications for clinical, financial, pharmacy, etc.
Part 1: What They Did | Data Lake to deliver better analytics
ChallengeMultiple healthcare applications operating in product silos
Want to implement a data lake and provide distributed processing for better analytics and predictions for their end customers:
• Population, risk, and Care management
• Clinical decision support using predictions
• Real time quality measures to assist providers reach their regulatory requirements
• Capacity management predictions
• Enrich clinical data with NLP on unstructured physicians notes to reduce over/under treatment, readmissions, and faster claims processing
SolutionChose Azure HDInsight, HBase, and custom .NET application
Begin by storing all medical records in Azure
Acquire, clean data and insert into data lake
Data LakeAnalytics
BK1
Healthcare Application Provider
Part 2: How They Did It | Data Lake to deliver better analytics
How They Did ItCollect data in Azure Blobs• Have medical records in JSON format
HDInsight processes data for insights• MapReduce preprocesses data
• used to insert data into data lake
• Developed .NET application to interface with HBase
Data Lake Analytics
Azure Blobs
Azure HDInsight
Medical Records in JSON
.NET Application
Insert into Blobs
References
• HDInsight Documentation (R, Storm, Spark, Kafka, ADL,..etc)
https://azure.microsoft.com/en-us/services/hdinsight/
• Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html
• edx.org: Free Apache Spark courses
• Get started with Data Science VMs in Azure
https://blogs.technet.microsoft.com/machinelearning/tag/dsvm/