fighting fraud with apache spark

Post on 13-Apr-2017

209 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fighting Fraud in Medicare with Apache Spark

Miklos Christine Solutions Architectmwc@databricks.com, @Miklos_C

About Me: Miklos ChristineSolutions Architect @ Databricks

- Assist customers architect big data platforms- Help customers understand big data best practices

Previously:- Systems Engineer @ Cloudera

- Supported customers running a few of the largest clusters in the world

- Software Engineer @ Cisco

Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricksin 2014

75%

3

Data Value

Created Databricks on top of Spark to make big data simple.

Next Generation Big Data Processing Engine

• Started as a research project at UC Berkeley in 2009• 600,000 lines of code (75% Scala)• Last Release Spark 1.6 December 2015• Next Release Spark 2.0 • Open Source License (Apache 2.0)• Built by 1000+ developers from 200+ companies

9

Apache Spark Engine

Spark Core

Spark StreamingSpark SQL SparkML /

MLLib

Graph Frames / GraphX

Unified engine across diverse workloads & environments

Scale out fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

History of Spark APIs

RDD(2011)

DataFrame(2013)

Distribute collection of JVM objects

Functional Operators (map, filter, etc.)

Distribute collection of Row objects

Expression-based operations and UDFs

Logical plans and optimizer

Fast/efficient internal representations

DataSet(2015)

Internally rows, externally JVM objects

Almost the “Best of both worlds”: type safe + fast

But slower than DF Not as good for interactive analysis, especially Python

Apache Spark 2.0 API

DataSet(2016)

• DataFrame = Dataset[Row]• Convenient for interactive analysis• Faster

DataFrame

DataSet

Untyped API

Typed API

• Optimized for data engineering• Fast

Benefit of Logical Plan:Performance Parity Across Languages

DataFrame

RDD

Machine Learning with Apache Spark

Why do Machine Learning?

• Machine Learning is using computers and algorithms to recognize patterns in data

• Businesses have to Adapt Faster to Change

• Data driven decisions need to be made quickly and accurately

• Customers expect faster responses

15

From Descriptive to Predictive to Prescriptive

16

••

Iterate on Your Models

18

Spark ML

Why Spark MLProvide general purpose ML algorithms on top of Spark

• Let Spark handle the distribution of data and queries; scalability• Leverage its improvements (e.g. DataFrames, Datasets, Tungsten)

Advantages of MLlib’s Design:• Simplicity• Scalability• Streamlined end-to-end• Compatibility

SparkMLML Pipelines provide:

• Integration with DataFrames• Familiar API based on

scikit-learn• Easy workflow inspection• Simple parameter tuning

21

Databricks & SparkML• Use DataFrames to directly access data (SQL, raw files)

• Extract, Transform and Load Data using an elastic cluster

• Create the model using all of the data

• Iterate many times on the model

• Deploy the same model to production using the same code

• Repeat

Advantages for Spark ML• Data can be directly accessed using the Spark Data Sources API (no more endless

hours copying data between systems)

• Data Scientist can use all of the data rather than subsamples and take advantage of the Law of Large numbers to improve model accuracy

• Data Scientist can scale compute needs with the data size and model complexity

• Data Scientists can iterate more giving them the opportunity to create better models and test and release more frequently

SparkML - Tips• Understand Spark Partitions

• Parquet file format and compact files• coalesce() / repartition()

• Leverage Existing Functions / UDFs• Leverage DataFrames and SparkML

• Iterative Algorithms • More cores for faster processing

24

What’s new Spark 2.0

Spark 2.0 - SparkML• MLLib is deprecated and in maintenance mode

• New Algorithm Support• Bisecting K-Means clustering, Gaussian Mixture Model, MaxAbsScaler

feature transformer.

• PySpark Update• LDA, Gaussian Mixture Model, Generalized Linear Regression

• Model Persistence across languages

26

Spark Demo

Thanks!

Sign Up For Databricks Community Edition! https://databricks.com/try-databricks

Learning more about MLlibGuides & examples• Example workflow using ML Pipelines (Python)• Power plant data analysis workflow (Scala)• The above 2 links are part of the Databricks Guide, which contains many more

examples and references.References• Apache Spark MLlib User Guide

• The MLlib User Guide contains code snippets for almost all algorithms, as well as links to API documentation.

• Meng et al. “MLlib: Machine Learning in Apache Spark.” 2015. http://arxiv.org/abs/1505.06807 (academic paper)

29

top related