from sas to python: road to analytics

From SAS to PySparkRoad to analytics

Contents

123

SAS vs Spark

SAS Proc SQL vs Spark SQL

Advantage Analytics

1. SAS vs Spark

OVERVIEW

SAS

○ The largest independent vendor in “advanced analytics”

○ 1976 foundation of the SAS Institute, Cary, North Carolina

○ Commercial software product

SPARK

○ A fast and general engine for large-scale data processing

○ Started in 2009 as a research project in the UC Berkeley, AMPLab

○ Open source

CODE

SAS

Basic programming model consists of code blocks:○ SAS Data Step

■ generation of data■ concatenation of data

○ SAS PROCedures■ special functionalities

SPARK

“Line based” programmingNative Language is Scala, but flexible programming model:

○ Scala○ Java○ Python○ R

DATA

SAS: DATASET

○ Computed in memory (RAM)

○ A data set contains:● observations: organized in

rows● variables: organized in

columns

SPARK: DATAFRAME

○ A distributed collection of data organized into named columns.

○ It is conceptually equivalent to:table in a relational database anddataframe in R/Python

○ It is a programming abstraction

IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE

Transformations like: map, filter, union, join, group by… results in an other dataset

SAS:data sasData

set sasData;

Fare2 = Fare + 2;

run;

Python Pandas:pandasDF['Fare2'] = pandasDF['Fare']+2

Spark:sparkDF = sparkDF

.withColumn('Fare2',sparkDF['Fare']+2)

NOTEBOOK

IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE

http://localhost:8888/notebooks/courses/SAS/notebooks/1.Dataframe.ipynb

http://localhost:8888/notebooks/courses/SAS/notebooks/1.Dataframe.ipynb

READ SAS DATASETS

The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored

● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY

○

http://localhost:8888/notebooks/courses/SAS/notebooks/2.Read_SAS_data_file_Python.ipynb

http://localhost:8888/notebooks/courses/SAS/notebooks/3.Read_SAS_data_file_R.ipynb

2. SAS Proc SQL vs Spark SQL

SQL sentences

SAS ProC SQL

SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables...

Spark SQL

○ Spark’s interface for working with structured and semi-structured data, query using SQL

○ Load data from JSON, Hive, Parquet

○ Evaluated “lazily”

SQL sentences

SAS ProC SQL

PROC SQL;

CREATE TABLE newTable ASSELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns;QUIT;

Spark SQL

sqlContext = new org.apache.spark.sql.SQLContext(sc)newTable = sqlContext.sql(“SELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns”)

NOTEBOOK

http://localhost:8888/notebooks/courses/SAS/notebooks/4.SQL_Sentences.ipynb

http://localhost:8888/notebooks/courses/SAS/notebooks/4.SQL_Sentences.ipynb

AGGREGATE FUNCTION IN SPARK SQL

sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis…

After aggregation

Act on each group of data, return a single

value as a result

WINDOW FUNCTION IN SPARK SQL

Ranking: rank, dense_rank, percent_rank, ntile, row_number

Analytics: cume_dist, lag, first_value, last_value, leadAggregate: aggregate funcs

Calculate a return value over a set of rows called

window that are somehow related to the

currentNOTEBOOK

http://localhost:8888/notebooks/courses/SAS/notebooks/5.Aggregate_Functions_and_Window_Functions.ipynb

http://localhost:8888/notebooks/courses/SAS/notebooks/5.Aggregate_Functions_and_Window_Functions.ipynb

EXTEND SPARK SQL

Standard functions are over 100 functions

(pyspark)

from pyspark.sql.functions import *

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

BUILT-IN FUNCTIONS, UDFs

“User Defined Function” Define new Column-based functions that extend the vocabulary of Spark

Act on a single row as an input, single return value for

every input row

NOTEBOOK

http://localhost:8888/notebooks/courses/SAS/notebooks/6.Extended_Spark_SQL.ipynb

http://localhost:8888/notebooks/courses/SAS/notebooks/6.Extended_Spark_SQL.ipynb

TIPS○ Not thinking in sorted data. In parallel process we can’t acces per row.

○ Cache tables/DFs when they are used more than once

○ Merge doesn’t need ordered data as SAS

○ Use functions already defined instead of creating your own UDF

○ Save data in columnar format as Parquet

○ Avoid collecting data when you are working with Big Data, take a sample

3. Advantage Analytics

ADVANTAGE ANALYTICS

SAS Stats

Traditional Add-on package to SAS for Statistics

○ Analysis of variance○ Bayesian analysis○ Categorical data analysis○ Distribution analysis○ Mixed models○ Predictive modeling...

Spark MLlib

Scalable machine learning library

○ Basic statistics○ Classification and regression○ Collaborative filtering○ Clustering○ Dimensionality reduction○ Feature extraction and

transformation...

BIBLIOGRAPHY

SPARK DOCUMENTATION:https://spark.apache.org/docs/2.0.0/

PYSPARK API:https://spark.apache.org/docs/2.0.0/api/python/index.html

PYSPARK FUNCTIONS: https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html

https://spark.apache.org/docs/2.0.0/

https://spark.apache.org/docs/2.0.0/

https://spark.apache.org/docs/2.0.0/api/python/index.html

https://spark.apache.org/docs/2.0.0/api/python/index.html

https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html

https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html

THANKS!Any questions?

@datiobd [email protected]

datio-big-data

https://www.linkedin.com/company/datio-big-data

https://www.linkedin.com/company/datio-big-data

from sas to python: road to analytics

Data & Analytics