fighting fraud with apache spark

29
Fighting Fraud in Medicare with Apache Spark Miklos Christine Solutions Architect [email protected], @Miklos_C

Upload: miklos-christine

Post on 13-Apr-2017

209 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Fighting Fraud with Apache Spark

Fighting Fraud in Medicare with Apache Spark

Miklos Christine Solutions [email protected], @Miklos_C

Page 2: Fighting Fraud with Apache Spark

About Me: Miklos ChristineSolutions Architect @ Databricks

- Assist customers architect big data platforms- Help customers understand big data best practices

Previously:- Systems Engineer @ Cloudera

- Supported customers running a few of the largest clusters in the world

- Software Engineer @ Cisco

Page 3: Fighting Fraud with Apache Spark

Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

3

Data Value

Created Databricks on top of Spark to make big data simple.

Page 4: Fighting Fraud with Apache Spark
Page 5: Fighting Fraud with Apache Spark
Page 6: Fighting Fraud with Apache Spark
Page 7: Fighting Fraud with Apache Spark
Page 8: Fighting Fraud with Apache Spark

Next Generation Big Data Processing Engine

Page 9: Fighting Fraud with Apache Spark

• Started as a research project at UC Berkeley in 2009• 600,000 lines of code (75% Scala)• Last Release Spark 1.6 December 2015• Next Release Spark 2.0 • Open Source License (Apache 2.0)• Built by 1000+ developers from 200+ companies

9

Page 10: Fighting Fraud with Apache Spark

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL SparkML /

MLLib

Graph Frames / GraphX

Unified engine across diverse workloads & environments

Scale out fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

Page 11: Fighting Fraud with Apache Spark

History of Spark APIs

RDD(2011)

DataFrame(2013)

Distribute collection of JVM objects

Functional Operators (map, filter, etc.)

Distribute collection of Row objects

Expression-based operations and UDFs

Logical plans and optimizer

Fast/efficient internal representations

DataSet(2015)

Internally rows, externally JVM objects

Almost the “Best of both worlds”: type safe + fast

But slower than DF Not as good for interactive analysis, especially Python

Page 12: Fighting Fraud with Apache Spark

Apache Spark 2.0 API

DataSet(2016)

• DataFrame = Dataset[Row]• Convenient for interactive analysis• Faster

DataFrame

DataSet

Untyped API

Typed API

• Optimized for data engineering• Fast

Page 13: Fighting Fraud with Apache Spark

Benefit of Logical Plan:Performance Parity Across Languages

DataFrame

RDD

Page 14: Fighting Fraud with Apache Spark

Machine Learning with Apache Spark

Page 15: Fighting Fraud with Apache Spark

Why do Machine Learning?

• Machine Learning is using computers and algorithms to recognize patterns in data

• Businesses have to Adapt Faster to Change

• Data driven decisions need to be made quickly and accurately

• Customers expect faster responses

15

Page 16: Fighting Fraud with Apache Spark

From Descriptive to Predictive to Prescriptive

16

••

Page 18: Fighting Fraud with Apache Spark

Iterate on Your Models

18

Page 19: Fighting Fraud with Apache Spark

Spark ML

Page 20: Fighting Fraud with Apache Spark

Why Spark MLProvide general purpose ML algorithms on top of Spark

• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)

Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

Page 21: Fighting Fraud with Apache Spark

SparkMLML Pipelines provide:

• Integration with DataFrames• Familiar API based on

scikit-learn• Easy workflow inspection• Simple parameter tuning

21

Page 22: Fighting Fraud with Apache Spark

Databricks & SparkML• Use DataFrames to directly access data (SQL, raw files)

• Extract, Transform and Load Data using an elastic cluster

• Create the model using all of the data

• Iterate many times on the model

• Deploy the same model to production using the same code

• Repeat

Page 23: Fighting Fraud with Apache Spark

Advantages for Spark ML• Data can be directly accessed using the Spark Data Sources API (no more endless

hours copying data between systems)

• Data Scientist can use all of the data rather than subsamples and take advantage of the Law of Large numbers to improve model accuracy

• Data Scientist can scale compute needs with the data size and model complexity

• Data Scientists can iterate more giving them the opportunity to create better models and test and release more frequently

Page 24: Fighting Fraud with Apache Spark

SparkML - Tips• Understand Spark Partitions

• Parquet file format and compact files• coalesce() / repartition()

• Leverage Existing Functions / UDFs• Leverage DataFrames and SparkML

• Iterative Algorithms • More cores for faster processing

24

Page 25: Fighting Fraud with Apache Spark

What’s new Spark 2.0

Page 26: Fighting Fraud with Apache Spark

Spark 2.0 - SparkML• MLLib is deprecated and in maintenance mode

• New Algorithm Support• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler

feature transformer.

• PySpark Update• LDA, Gaussian Mixture Model, Generalized Linear Regression

• Model Persistence across languages

26

Page 27: Fighting Fraud with Apache Spark

Spark Demo

Page 28: Fighting Fraud with Apache Spark

Thanks!

Sign Up For Databricks Community Edition! https://databricks.com/try-databricks

Page 29: Fighting Fraud with Apache Spark

Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains many more

examples and references.References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

29