from sas to python: road to analytics
TRANSCRIPT
From SAS to PySparkRoad to analytics
Contents
123
SAS vs Spark
SAS Proc SQL vs Spark SQL
Advantage Analytics
1. SAS vs Spark
OVERVIEW
SAS
○ The largest independent vendor in “advanced analytics”
○ 1976 foundation of the SAS Institute, Cary, North Carolina
○ Commercial software product
SPARK
○ A fast and general engine for large-scale data processing
○ Started in 2009 as a research project in the UC Berkeley, AMPLab
○ Open source
CODE
SAS
Basic programming model consists of code blocks:○ SAS Data Step
■ generation of data■ concatenation of data
○ SAS PROCedures■ special functionalities
SPARK
“Line based” programmingNative Language is Scala, but flexible programming model:
○ Scala○ Java○ Python○ R
DATA
SAS: DATASET
○ Computed in memory (RAM)
○ A data set contains:● observations: organized in
rows● variables: organized in
columns
SPARK: DATAFRAME
○ A distributed collection of data organized into named columns.
○ It is conceptually equivalent to:table in a relational database anddataframe in R/Python
○ It is a programming abstraction
IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE
Transformations like: map, filter, union, join, group by… results in an other dataset
SAS:data sasData
set sasData;
Fare2 = Fare + 2;
run;
Python Pandas:pandasDF['Fare2'] = pandasDF['Fare']+2
Spark:sparkDF = sparkDF
.withColumn('Fare2',sparkDF['Fare']+2)
NOTEBOOK
IMMUTABLE, PARTITIONED,DISTRIBUTEDDATA STRUCTURE
READ SAS DATASETS
The SAS-FILE (sas7bdat) is a file with special structure created by SAS and binary stored
● PYTHON: SAS7BDAT PACKAGE ● R: HAVEN LIBRARY
○
2. SAS Proc SQL vs Spark SQL
SQL sentences
SAS ProC SQL
SAS Procedure that combines the functionality of DATA and PROC steps. It can sort, summarize, subset, join, concatenate datasets, create new variables...
Spark SQL
○ Spark’s interface for working with structured and semi-structured data, query using SQL
○ Load data from JSON, Hive, Parquet
○ Evaluated “lazily”
SQL sentences
SAS ProC SQL
PROC SQL;
CREATE TABLE newTable ASSELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns;QUIT;
Spark SQL
sqlContext = new org.apache.spark.sql.SQLContext(sc)newTable = sqlContext.sql(“SELECT ColumnsFROM TableWHERE Column > ValueGROUP BY Columns”)
NOTEBOOK
AGGREGATE FUNCTION IN SPARK SQL
sum, avg, mean, count, max, min, first, last, sttedev, variance, skewness, kurtosis…
After aggregation
Act on each group of data, return a single
value as a result
WINDOW FUNCTION IN SPARK SQL
Ranking: rank, dense_rank, percent_rank, ntile, row_number
Analytics: cume_dist, lag, first_value, last_value, leadAggregate: aggregate funcs
Calculate a return value over a set of rows called
window that are somehow related to the
currentNOTEBOOK
EXTEND SPARK SQL
Standard functions are over 100 functions
(pyspark)
from pyspark.sql.functions import *
BUILT-IN FUNCTIONS, UDFs
“User Defined Function” Define new Column-based functions that extend the vocabulary of Spark
Act on a single row as an input, single return value for
every input row
NOTEBOOK
TIPS○ Not thinking in sorted data. In parallel process we can’t acces per row.
○ Cache tables/DFs when they are used more than once
○ Merge doesn’t need ordered data as SAS
○ Use functions already defined instead of creating your own UDF
○ Save data in columnar format as Parquet
○ Avoid collecting data when you are working with Big Data, take a sample
3. Advantage Analytics
ADVANTAGE ANALYTICS
SAS Stats
Traditional Add-on package to SAS for Statistics
○ Analysis of variance○ Bayesian analysis○ Categorical data analysis○ Distribution analysis○ Mixed models○ Predictive modeling...
Spark MLlib
Scalable machine learning library
○ Basic statistics○ Classification and regression○ Collaborative filtering○ Clustering○ Dimensionality reduction○ Feature extraction and
transformation...
BIBLIOGRAPHY
SPARK DOCUMENTATION:https://spark.apache.org/docs/2.0.0/
PYSPARK API:https://spark.apache.org/docs/2.0.0/api/python/index.html
PYSPARK FUNCTIONS: https://spark.apache.org/docs/2.0.0/api/python/_modules/pyspark/sql/functions.html