big data analytics with r - uibk.ac.at data... · data processing with r using statistical...

62
Data Processing with R Using statistical computing and graphics Software for Business Intelligence and Analytics on a Data Warehouse Environment

Upload: dinhnhu

Post on 11-Aug-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

Data Processing with RUsing statistical computing and graphics Software for Business Intelligence and Analytics on a Data Warehouse Environment

Get these slides here

About this Course

This course is tailored towards IS Master students and focuses on the handling and processing of data in an Enterprise environment where data are often found in relational structures (e.g. SQL Server, Oracle or MySQL) and analysis is supported by (self-service) business intelligence tools (e.g. Microsoft Excel, PowerPivot or R).

This course takes the case of using the statistical computing and graphics software R on a relational Data Warehouse which is located on a Microsoft SQL Server. It particularly aims at (1) extracting analysis-oriented data from the storage, (2) processing data using statistical procedures and (3) generating graphical output material.

Course Objectives

• Practice data processing in an enterprise-like setting

• Use typical technology that is used in enterprises

• Reflect on the limits and potentials of data processing techniques

Value Chain of Data Analysis

Course Focus

D(Big) Data

PProcessing

OOutput

KKnowledge

AAction

IImpact

extract generate discover initiate realize

Course Assignment Part 1: ReaderPapers:Chaudhuri, S., Dayal, U., & Narasayya, V. (2011). An overview of business intelligence technology. Communications of the ACM, 54(8), 88–98.Smith, D. (2014). R Is Still Hot – and Getting Hotter. Revolution Analytics White Paper. Retrieved from http://www.revolutionanalytics.com/r-is-still-hotAli-Hassan, H. et al. (2015). Linking dimensions of social media use to job performance: The role of social capital. Journal of Strategic Information Systems 24 (2015) 65–89Benthaus, J. et al. (2016). Social media management strategies for organizational impression management and their effect on public perception. Journal of Strategic Information Systems. In Press.

Questions for papers 1 and 2:• How can Business Intelligence and Analytics

contribute to business impact?• How can technology help to assist Business

Intelligence and Analytics?• What limitations can you see in the application of

Business Intelligence and Analytics?• Give a short example of where Business

Intelligence and Analytics help management decisions.

Questions for papers 3 and 4:• What are the research questions and contributions

of the papers?• What methods of data collection were used?• What methods of data analysis were used?• How are the results visualized and discussed?• What are the limitations of the studies?• Give a summarizing evaluation about the overall

quality of the papers.

Course Assignment Part 2: Practical ExampleExercise:

Explore the data set InternetSalesAmount.csv with an Analysis Software of your choice (e.g. Microsoft Excel), create a visual summary and interpret your results.

Questions:• Explore the Distribution of SalesAmountsUSD over

CustomerAge and SalesTerritoryGroup.• Explore the development of SalesAmountUSD over

time.• Explore the distribution of SalesAmountUSD over

other Customer attributes of your choice.

Agenda

• Introduction to R• Getting Started with R Console and R Commander• Data types and Structures• Connecting to a Data Mart / Cube• Exploring and Visualizing Data• Programming Structures, Functions, and Data Relationships

Introduction to R

Introduction to R• R is an integrated software suit for

• data manipulation• calculation• graphical display

• R was initially created in 1993• R is developed as a GNU project • R is supported by a global community• R is extendable through packages• R can be used in multiple way

• Windows CMSD• R GUI• R Studio

Windows CMD

R GUI

R Studio

(Ihaka et al. 1996, R Project 2016)

CODE EXAMPLE

R Packages

• Packages only have to be installed once

• From CRAN• From file

• Packages need to be loaded before they can be used

># 2 Options to install packages

>install.packages("RODBC")

>install.packages("rodbc.tar.gz")

>

># 2 Options to use packages

>library("RODBC")

>require("RODBC")

CODE EXAMPLE

Basic R Syntax• Working directory can be

set and retrieved• R supports all kinds of

mathematical operations• <- is the preferred

assignment operator (= is not recommended)

• Outputs are marked with a line-number in (e.g. [1])

• R chooses a suitable data type itself

>setwd("C:/Test/")

>getwd()

[1] "C:/Test/"

>a <- 4 + 1

>b <- 4

>a + b

[1] 9

>ls()

[1] "a" "b"

>typeof(a)

[1] "double"

CODE EXAMPLE

Get Help in R

• R Help is available as html page on your computer

• You can browse help yourself

• You can ask for help with a specific concept

• You search across all installed packages

># Open Help

>help.start()

>

># Search for specific content

>help(mean)

>?mean

>

># Search across all installed projects

>??mean

CODE EXAMPLE

Further Operations in R• R can work with sequences

and vectors as well• Vectors can be input for

further operations (e.g. multiplication with a scalar)

># Create a sequence of numbers

>digits <- 0:10

>digits

[1] 0 1 2 3 4 5 6 7 8 9 10

>sequence <- seq(from=0, to=10, by=1)

>sequence

[1] 0 1 2 3 4 5 6 7 8 9 10

># Create a vector

>vector <- c(1,3,5,7,9)

>vector [1] 1 3 5 7 9

>vector * 1.5

[1] 1.5 4.5 7.5 10.5 13.5

An Enterprises Electronic MemoryThe Data Warehouse

StagingETL

OLTP Database

OLTP Database

OLTP Database

Cube

… Relational Database

… OLAP Cube

Data Structure we are working with

Stage 1 Stage … Data Mart

Adventure Works Case Example

Adventure Works is a fictional company that is selling bicycles and biking accessories. Besides several OLTP databases, Adventure Works is using an OLAP Cube for data analysis which is tailored towards:• Finance Analysis• Product Inventory Analysis• Resales Analysis• Internet Sales AnalysisWe are going to use the Data Warehouse that backs these analysis scenarios.

Basic Structure of the Data Warehouse

Depa

rtmen

t Gro

upDi

men

sion

Fact Table

Data Model: Finance

Data Model: Product Inventory

Data Model: Resales

Data Model: Internet Sales

CODE EXAMPLE

Example Query: Finance

SELECTfact.Amount,fact.Date,acct.AccountType,dat.CalendarQuarter,dat.CalendarSemester,dat.CalendarYear,dat.DayNumberOfYear,org.OrganizationName,org.PercentageOfOwnership

FROM [dbo].[FactFinance] AS factJOIN [dbo].[DimOrganization] AS orgON fact.OrganizationKey = org.OrganizationKeyJOIN [dbo].[DimDate] AS datON dat.DateKey = fact.DateKeyJOIN [dbo].[DimAccount] AS acctON acct.AccountKey = fact.AccountKey

http://tinyurl.com/zqau98u

CODE EXAMPLE

Example Query: Product Inventory

SELECTfact.[MovementDate],fact.[UnitCost],fact.[UnitsIn],fact.[UnitsOut],fact.[UnitsBalance],dat.CalendarQuarter,dat.CalendarSemester,dat.CalendarYear,dat.DayNumberOfYear,prod.DealerPrice AS ProductDealerPrice,prod.EnglishProductName AS ProductName,prod.EnglishDescription AS ProductDescription,prod.[Weight] AS ProductWitght,prod.Size AS ProductSize,prod.SafetyStockLevel AS ProductSafetyStockLevel

FROM [AdventureWorksDW2012].[dbo].[FactProductInventory] AS factJOIN [AdventureWorksDW2012].[dbo].[DimDate] AS datON fact.DateKey = dat.DateKeyJOIN [AdventureWorksDW2012].[dbo].[DimProduct] AS prodON fact.ProductKey = prod.ProductKey

http://tinyurl.com/z9kkmph

CODE EXAMPLE

Example Query: Resales

SELECTfact.[OrderQuantity],fact.[UnitPrice],fact.[ExtendedAmount],fact.[DiscountAmount],fact.[ProductStandardCost],fact.[TotalProductCost],fact.[SalesAmount] AS SalesAmountLCY,ROUND(crat.AverageRate*fact.SalesAmount,2) AS SalesAmountUSD,fact.[TaxAmt],fact.[Freight],fact.[OrderDate],cter.City AS ResellerCity,cter.EnglishCountryRegionName AS ResellerCountry,res.AnnualSales AS ResellerAnnualSale,res.AnnualRevenue AS ResellerAnnualRevenue

,res.NumberEmployees AS ResellerNumberOfEmployees,res.ResellerName,res.OrderFrequency AS ResellerOrderFrequency,res.YearOpened AS ResellerYearOpened,res.FirstOrderYear AS ResellerFirstOrderYear,res.LastOrderYear AS ResellerLastOrderYear

FROM [AdventureWorksDW2012].[dbo].[FactResellerSales] AS factJOIN [AdventureWorksDW2012].[dbo].[DimReseller] AS resON res.ResellerKey = fact.ResellerKeyJOIN [AdventureWorksDW2012].[dbo].DimSalesTerritory AS sterON ster.SalesTerritoryKey = fact.SalesTerritoryKeyJOIN [AdventureWorksDW2012].[dbo].DimGeography AS cterON res.GeographyKey = cter.GeographyKeyJOIN AdventureWorksDW2012.dbo.FactCurrencyRate AS cratON crat.DateKey = fact.OrderDateKeyAND crat.CurrencyKey = fact.CurrencyKey

http://tinyurl.com/hsqcort

CODE EXAMPLE

Example Query: Internet Sales

SELECTfact.[OrderQuantity],fact.[UnitPrice],fact.[ExtendedAmount],fact.[DiscountAmount],fact.[ProductStandardCost],fact.[TotalProductCost],fact.[SalesAmount] AS SalesAmountLCY,ROUND(crat.AverageRate*fact.SalesAmount,2) AS SalesAmountUSD,fact.[TaxAmt],fact.[Freight],fact.[OrderDate],YEAR(CURRENT_TIMESTAMP)-YEAR(cust.BirthDate) AS CustomerAge,cter.City AS CustomerCity,cter.EnglishCountryRegionName AS CustomerCountry,cust.Gender AS CustomerGender,cust.HouseOwnerFlag AS CustomerIsHouseOwner,cust.MaritalStatus AS CustomerIsMarried

,cust.TotalChildren AS CustomerNumberOfChildren,cust.YearlyIncome AS CustomerYearlyIncome,cust.CommuteDistance AS CustomerCommuteDistance,cust.NumberCarsOwned AS CustomerNumberOfCars,cust.NumberChildrenAtHome AS CustomerNumberOfChildrenAtHome

FROM [AdventureWorksDW2012].[dbo].[FactInternetSales] AS factJOIN [AdventureWorksDW2012].[dbo].DimCustomer AS custON cust.CustomerKey = fact.CustomerKeyJOIN [AdventureWorksDW2012].[dbo].DimSalesTerritory AS sterON ster.SalesTerritoryKey = fact.SalesTerritoryKeyJOIN [AdventureWorksDW2012].[dbo].DimGeography AS cterON cust.GeographyKey = cter.GeographyKeyJOIN AdventureWorksDW2012.dbo.FactCurrencyRate AS cratON crat.DateKey = fact.OrderDateKeyAND crat.CurrencyKey = fact.CurrencyKey

http://tinyurl.com/j4edktw

Connect R to the Adventure Works DWH

Optional: Create a local Data Source Reference

CODE EXAMPLE

Connecting to an ODBC Data Source

• ODBC = Open Database Connectivity

• Create a DSN• Control panel• Administrative tools• ODBC data sources• Add

• Use RODBC package• Use sqlFetch method

library(RODBC)

odcbChannel <- odbcConnect("Local Host")

strSQLQuery <- "SELECT * FROM ..."

datMyData <- sqlQuery(odcbChannel, strSQLQuery)

# This is where you can put your code

odbcClose(odcbChannel)

# Alternatively: Directly connect with connection string

odcbChannel <- odbcDriverConnect("driver={SQL Server};server=c437-sql-12;database=AdventureWorksDW2012;trusted_connection=true")

https://andersspur.wordpress.com/2013/11/26/connect-r-to-sql-server-2012-and-14/

Visualization with

ggplot2

ggplot Package

• Package for Visualization of various sorts of data

• CRAN Documentation: https://cran.r-project.org/web/packages/ggplot2/ggplot2.pdf

Download ggplot2

Sample Scripthttp://tinyurl.com/hatjdjo

https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf

Visualisations that might be helpful

• Bar Chart• One variable plot• One variable plot with category

• Scatter Plot• Two variables scatter• Three variables scatter

• Treemap Plot• Corrplot• Map Plot

CODE EXAMPLE

Scatterplot (2 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat, y=df$price)) + geom_point()

CODE EXAMPLE

Scatterplot (2 Variables, Transformed)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=log(df$carat), y=log(df$price))) +geom_point()

CODE EXAMPLE

Scatterplot (3 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut))+ geom_point()

CODE EXAMPLE

Scatterplot (4 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color)) + geom_point()

CODE EXAMPLE

Scatterplot (5 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color, size=df$depth)) + geom_point()

CODE EXAMPLE

Scatterplot (6 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color, size=df$depth)) + geom_point() +facet_grid(. ~ clarity)

# Attention: Do not use the dataframe prefix in facet_grid

CODE EXAMPLE

Scatterplot (7 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat, y=df$price, shape=df$cut, color=df$color, size=df$depth)) + geom_point() +facet_grid(clarity ~ clarity)

# Attention: Do not use the dataframe prefix in facet_grid

Summary: ggplot2 Aestetics Parameters

• x … Distance of a point on the X-Axis• y … Distance of a point on the Y-Axis• shape … Plot character that is used for a point• color … Plot colour that is used for a point• size … Plot size that is used for a point

CODE EXAMPLE

Smoothed Curve Plot

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=data$carat, y=data$price)) +geom_smooth()

# Info: Can also be used with additional aesparameters

CODE EXAMPLE

Smoothed Curve Plot with Points

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=data$carat, y=data$price)) +geom_point() + geom_smooth()

# Info: Can also be used with additional aesparameters

# Attention: Order of Geoms makes a difference

CODE EXAMPLE

Histogram Plot

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat)) + geom_histogram()

# Info: Can also be used with additional aesparameters

CODE EXAMPLE

Density Plot

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$carat)) + geom_density()

# Info: Can also be used with additional aesparameters

CODE EXAMPLE

Bar Chart (1 Variable)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$color)) + geom_bar()

CODE EXAMPLE

Bar Chart (2 Variables)

library(diamonds, package = "ggplot2")

df <- diamonds

ggplot(df, aes(x=df$color, color=df$cut)) + geom_bar()

CODE EXAMPLE

Line Plot

library(diamonds, package = "ggplot2")

df <- diamond

ggplot(df, aes(x=df$carat, y=df$price)) +geom_line(color="blue")

CODE EXAMPLE

Correlation Matrix (Corrplot)

library(diamonds, package = "ggplot2")

df <- diamond

# Correlate all numeric variables

corMatrix <- cor(df[,unlist(lapply(df, is.numeric))])

# Transform cross table

CorrList <- melt(corMatrix)

CorrList$value <- round(CorrList$value,digits=2)

# Plot Correlation Matrix

ggplot(CorrList, aes(X1, X2, fill=value)) + geom_tile() +geom_text(aes(label=CorrList$value), colour="white")

# Alternatively, you can use the corrplot package with a lot more additional features

Visualization with

ggmap

ggmap Package

• Package for Visualization of spatial data

• CRAN Documentation: https://cran.r-project.org/web/packages/ggmap/ggmap.pdf

Download ggmap

Sample Scripthttp://tinyurl.com/hz5fopr

CODE EXAMPLE

Plot of Map Object

library(ggmap)

dfSpatial <-read.csv("http://www.innsbruck.gv.at/data.cfm?vpath=diverse/ogd/gis/universitaetcsv", sep=";", header = TRUE, dec =",")

map <- get_map("Innsbruck", zoom = 14, maptype ="roadmap")

ggmap(map)

CODE EXAMPLE

Spatial Plot of Locations

library(ggmap)

dfSpatial <-read.csv("http://www.innsbruck.gv.at/data.cfm?vpath=diverse/ogd/gis/universitaetcsv", sep=";", header = TRUE, dec =",")

map <- get_map("Innsbruck", zoom = 12, maptype="roadmap")

ggmap(map) + geom_point(data=dfSpatial, aes(x=Lon, y=Lat), size=5, shape=10)

CODE EXAMPLE

Spatial Plot with Size Variable

library(ggmap)

dfCity <-read.csv("https://dl.dropboxusercontent.com/u/6141049/Data%20Processing%20with%20R/CityLongLat.txt", sep=";", header=TRUE, stringsAsFactors=FALSE)

dfCitySmall <- subset(dfCity, ISO.2=="AT")

dfCitySmall$Size <-dfCitySmall$Population/max(dfCitySmall$Population)*10

map <- get_map("Austria", zoom = 6, maptype="roadmap“, color=“bw")

ggmap(map) + geom_point(data=dfCitySmall, aes(x=Longitude, y=Latitude), size = dfCitySmall$Size, color="red")

Visualization with

treemap

treemap Package

• Visualize multi-category data with quantities

• CRAN Documentation: https://cran.r-project.org/web/packages/treemap/treemap.pdf

Download treemap

Sample Scripthttp://tinyurl.com/jcb96aw

CODE EXAMPLE

Treemap Plot

library(treemap)

dfCity <-read.csv("https://dl.dropboxusercontent.com/u/6141049/Data%20Processing%20with%20R/CityLongLat.txt", sep=";", header=TRUE, stringsAsFactors=FALSE)

dfCitySmall <- subset(dfCity, ISO.2=="AT")

treemap(dfCitySmall, index=c("Province"), vSize="Population", type="index")

http://www.datavizcatalogue.com/

Thank You!