is5126 - howba - nus computingphantq/is5126/ay2016_2017_2/is5126_lectur… · lecture 2 – data,...

63
IS5126 - HowBA Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18, 2017 Dr. Tuan Q Phan NUS IS5126

Upload: lamdien

Post on 01-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

IS5126-HowBA

Lecture2–Data,Databases,SQL,BehavioralAnalyAcs;Jan18,2017

Dr.TuanQPhanNUSIS5126

Page 2: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Admin

•  Pickupsyllabusandschedule,alsoavailableonmywebsite:hQp://www.tuanqphan.us

•  PurchaseHBSCasefromhQp://hbsp.harvard.edu– Data.gov,#9-610-075

•  Signupteamof4onIVLEbyJan.30– UseIVLEforumstofindteammates

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 3: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

LearningObjecAves

•  Data.govCaseDiscussionandPresentaAons•  DataManipulaAon,ETL•  SQL

– DatabaseDesign– BestPracAces– NormalizaAonGuidelines

•  MarkeAngandBehavioralAnalyAcs•  Mini-case

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 4: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

LearningObjecAves•  Products

–  ProductLifeCycle–  Supply/Demand–  MarketBasket–  MarkeAngStrategy

•  People–  CRM–  UAlityModeling

•  OrganizaAons/Companies–  CompeAAon–  Strategy

•  CorrelaAonandCausaliAes•  Resource:

–  TheTenDayMBA,StevenSilbiger–  50Social/Psycologybooks:hQp://www.sparringmind.com/psychology-books/

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 5: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

DatabasesandManipulaAon

RealWorld

Rawdata

Dataware-house

CollecAon Import

Transform

Analyze Report

DATAFLOW

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 6: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

DataManipulaAon•  Rawdataislarge,unstructured,noisy•  Extract,Transform,Load(ETL):processto“cleanup”thedatafor

processingandstorage•  Extract:parsing,collecAonfrommulAplesources/formats,

webscraping•  Transform:converttoappropriateformat,applysetofrules,noise

reducAon,errorhandling,translatecodes,validaAon–  Python,SQL,awk,sed,….

•  Load:loadsintothedatawarehouse(database)•  Stagingenvironment•  Resource:TheDataWarehouseETLToolkit,RalphKimball&Joe

Caserta

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 7: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

DataStorageTechnology

•  Dataislarge,needtostore,organize,andmanipulate

•  Approaches:– Filesystem:tapedrive,harddisks,RAID,solidstates,NAS

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 8: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–IntroducAon•  SQL:“StructuredQueryLanguage,”(aka“sequel”)

–  LanguagefordatamanipulaAon–  Independentofstoragemedium

•  Manyvariants,standardizedANSI•  RelaAonalmodelfordatabasemanagement•  HeavilyusedinBA•  DevelopedbyEdgarCodd,IBMResearchLaboratoryin1970s•  Highlypopular1980’s,1990s,2000s,?•  SoluAons:

–  Commercialproducts:Oracle,MicrosopAccess,IBMDB2–  Open-source:MySQL(Oracle),PostgreSQL,SQLite–  BigData:Hive/Hadoop,Netezza(IBM)

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 9: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL-IntroducAon

•  Dataintables,rows,andcolumns(akarelaAon,tuple,aQributes)

•  ValueatparAcular(row,column)•  Rowas“unitofanalysis”•  Primarykey:columnwithuniqueidenAfierforrow•  Fewcommands:

–  TablemanipulaAon:CREATE,ALTER,DROP,(GRANT)–  DatamodificaAon:INSERT,DELETE,UPDATE–  Querydata:SELECT

•  Resource:hQp://www.sqlite.org

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 10: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL-CREATE

•  Createanamedtablewithnamedcolumnsandtypes,“schema”

CREATE TABLE books(

id int not null primary key, title text,

published_year int, price double

);

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 11: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–CREATEDatatypes

•  Columnsmustbeofatype–  Fixed-width:fastaccess,efficient–  Variable-width:flexible

•  Numbers:fixed-width–  int:Anyint,smallint,mediumint,bigint,unsigned–  double

•  Text:variable-width•  Date:notypeinsqlite3,int,dateAme,Amestamp

–  string,eg.“Aug.28,2012”–  “UnixAme”,numberofsecondssinceJan1,1970UTC–  Timezones

•  Binarydata(eg.Image):blob

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 12: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–ALTER&DROP

•  ModifiesanexisAngtableschemaalter table books add column author text;

•  Removesatableschema(anditsdata)drop table books;

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 13: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL-INSERT

•  Addsdatatotableinsert into books values (1, "Practical SQL", 1998, 14.00, "Bowman");

insert into books values (2, "Data Mining", 2011, 26.85, "Linoff");

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 14.0 Bowman 2 Data Mining 2011 26.85 Linoff

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 14: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–Loadingdata

•  Loaddatafromacsvfile:sopware-specificbooks.csv

3,"Scoring Points",2008,22.00,"Humby"

4,"Business Intelligence",2009,57.85,"Vercellis”

.separator ","

.import books.csv books

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 14.0 Bowman 2 Data Mining 2011 26.85 Linoff 3 Scoring Point 2008 22.0 Humby 4 Business Inte 2009 57.85 Vercellis

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 15: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–DELETE&UPDATE

•  Deletesarowdelete from books where id=4;

•  Modifiesvalue(s)Update books set price=5.00;

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 16: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–SELECT

•  Querydatabaseselect * from books;

•  Sortresultsselect * from books order by published_year desc;

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis

id title published_year price author ---------- ----------- -------------- ---------- ---------- 2 Data Mining 2011 5.0 Linoff 4 Business In 2009 5.0 Vercellis 3 Scoring Poi 2008 5.0 Humby 1 Practical S 1998 5.0 Bowman

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 17: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–SELECT…WHERE

•  Whereclausesubsetsresultsselect title, author from books where published_year > 2000;

•  CombiningcondiAonsselect * from books where published_year > 2000 and author="Linoff";

title author published_year ----------- ---------- -------------- Data Mining Linoff 2011 Scoring Poi Humby 2008 Business In Vercellis 2009

id title published_year price author ---------- ----------- -------------- ---------- ---------- 2 Data Mining 2011 5.0 Linoff

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 18: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–SELECT…FUZZY

•  Allowsforwildcardstringmatching

select * from books where title like “%ness%”;

id title published_year price author ---------- --------------------- -------------- ---------- ---------- 4 Business Intelligence 2009 5.0 Vercellis

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 19: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–Groupby

•  Aggregatebyacolumn:insert into books values(5,"2008 book",2008,25.00,"Phan");

select published_year, count(*), avg(price), sum(price) from books group by published_year;

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis 5 2008 book 2008 25.0 Phan

published_year count(*) avg(price) sum(price) -------------- ---------- ---------- ---------- 1998 1 5.0 5.0 2008 2 15.0 30.0 2009 1 5.0 5.0 2011 1 5.0 5.0

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 20: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL–Embeddedqueriesselect avg(sub.num_books) from (select published_year, count(*) as num_books from books group by published_year) sub;

published_year num_books -------------- ---------- 1998 1 2008 2 2009 1 2011 1

avg(sub.num_books) ------------------ 1.25

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 21: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL-JOIN

•  Abilitytocombinefromtwoormoretablesbycolumns,“JOIN”

select * from books b, publish_year p where b.published_year=p.year;

Whereis1998?

year num_books ---------- ---------- 2008 100 2009 120 2010 90 2011 104

id title published_year price author ---------- ------------- -------------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 3 Scoring Point 2008 5.0 Humby 4 Business Inte 2009 5.0 Vercellis 5 2008 book 2008 25.0 Phan

id title published_year price author year num_books ---------- ----------- -------------- ---------- ---------- ---------- ---------- 2 Data Mining 2011 5.0 Linoff 2011 104 3 Scoring Poi 2008 5.0 Humby 2008 100 4 Business In 2009 5.0 Vercellis 2009 120 5 2008 book 2008 25.0 Phan 2008 100

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 22: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

SQL-Sets

XY Z

TableA TableB

1.   Innerjoin2.   Le9Join

X = A B

A =Y X select * from books b left join publish_year p on b.published_year=p.year;

select * from books b inner join publish_year p on b.published_year=p.year;

id title published_year price author year num_books ---------- ------------- -------------- ---------- ---------- ---------- ---------- 1 Practical SQL 1998 5.0 Bowman 2 Data Mining 2011 5.0 Linoff 2011 104 3 Scoring Point 2008 5.0 Humby 2008 100 4 Business Inte 2009 5.0 Vercellis 2009 120 5 2008 book 2008 25.0 Phan 2008 100 Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 23: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Longvs.wide

•  Longtablesvs.widetables•  pivottable,crosstabulaAon,report

trans_id book_id year num_books ---------- ------- ---- ---------- 1 1 2008 5 2 1 2008 1 3 1 2009 1 4 2 2011 3 5 3 2009 4 6 3 2009 1 7 4 2010 1 8 4 2010 5 9 4 2011 2 10 5 2010 1

book_id y2008 y2009 y2010 y2011 ---------- ----- ----- ----- ---------- 1 6 1 0 0 2 0 0 0 3 3 0 5 0 0 4 0 0 6 2 5 0 0 1 0

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 24: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

DatabaseDesign•  Howtodesigntableschema?•  Whichcolumnsgowhere?•  GooddesigncharacterisAcs:

– MakesinteracAonswithdatabaseeasytounderstand–  Consistencyofvaluesanddatabase–  Highperformance

•  BaddesigncharacterisAcs:– Misunderstandingofquery–  Increasedriskofinconsistencies–  Redundantdataentry–  Difficulttochangestructureofthetables

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 25: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

DatabaseDesign•  Normaliza:on:reduceduplicates,protectdataintegrity

•  Non-lossdecomposi:on:spliungtableswithredundantvaluesintotwoormoretables–  Jointo“putbacktogether”

•  Clear,easytoreadtableandcolumnnames:–  Eg.books_prices,author_firstname,books,authors

•  EnAty-relaAonship(ER)modeling•  DefinerelaAonshiptypes:1-1,1-N,N-N•  Nomagicbullet,iterateandexperience

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 26: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

GeneralGuidelines1.  WhatkindofquesAonsarewetryingto

answer?2.  Whatarethesourcesofdata?3.  WhicharethefocalenAAesorsubjects?

•  RowwasonethingintheenAty,columnsasaQributes

•  IndependentExistence4.  Groupcommoncolumns,useE-Rdiagrams

tohelp5.  DetermineuniqueidenAfier–primarykey6.  WhataretherelaAonshipsbetween

enAAes:1-1,1-N,N-N7.  Normalizeandverify8.  Testandreiterate

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 27: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

NormalizaAonGuidelines•  Firstnormalform:

–  eachrow-columnintersecAonmustbeoneandonlyonevalue–  mustbeatomic–  norepeaAnggroups–  “rectangular”tables

Bad:BeQer:

Order_id Book_id1 Transact_date1

Book_id2 Transact_date2

1 1 19/10/2010

2 1 01/10/2010 2 01/10/2010

Record_id Order_id Book_id Transact_date

1 1 1 19/10/2010

2 2 1 01/10/2010

3 2 2 01/10/2010

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 28: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

NormalizaAonGuidelines•  Secondnormalform

–  “Everynon-keycolumnmustdependontheenAreprimarykey”

–  Compositeprimarykey•  Thirdnormalform

–  Nonon-keycolumndependonanothernonkeycolumn•  Fourthnormalform

–  Noindependent1-NrelaAonshipsbetweenprimarykeycolumnsandnon-keycolumns:toomanyblanks

•  Fiphnormalform–  Breaktablesintosmallestpossiblepiecesinordertoeliminateallredundancywithinatable.

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 29: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

CombineSQL&Python

•  PythonloopstocreateSQLcode•  UsedforaggregaAonor“pivottables”•  SimplescripAng

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 30: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

WhentousePython,SQL,R?•  Similartoolsforalllanguages•  Excel:filters,sort,pivottable,…

–  Pro:easyGUI,“intuiAve,”easyforprototyping–  Cons:slow,cannothandlelargedatasets,requireshighlystructureddata,

limitedtools,$$$•  Python:dicAonaries,loops,NumPy,etc…

–  Pro:flexible,fast,goodforbigdatasets,rich/mulAmediadata–  Cons:slowfilesystems,limitedtools,complicatedforsimpletasks

•  SQL:select,groupby,…–  Pro:manycommercialandopensourcesoluAons,fast(whenstructured

properly)–  Cons:requiresstructureddata,limitedbinarydatasupport,$$$

•  R:indices,aggregate,ddply,data.table…–  Pro:singlelanguage/framework,manypackagesforfastETL–  Cons:Memoryinefficient,slow,singleprocessor(exceptRevoluAonR),

inconsistentnotaAonacrosspackages

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 31: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

BestPracAceGuidelines•  Time-space(bestpracAces)•  Bigrawdatabestinfilesystems(harddrive)

–  PythonforrawdatacollecAon,binarydata–  Input:rawdata–  Output:semi-structured,non-normalized(eg.csv)

•  ETLandmanipulaAonindatawarehouse(eg.SQL)–  Sqlite:easytouse,standardANSI–  MySQL:free,opensource,fastreads–  Oracle:transacAondata(writes)–  Hadoop:bigandslow,HiveprovidesSQL-likenotaAon–  Input:semi-structured–  Output:highlystructured,transformeddatareadyforanalysis,unitofanalysisonreachrow

•  AnalysisinstaAsAcaltools(R,Stata,SPSS,Matlab,etc…):–  Commercialandopensourceavailable–  Commercialfaster,higherperformance,beQermemorymanagement–  Input:highlystructured–  Output:reports,analysis,insights,visualizaAons

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 32: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Misc.

•  Otherdatabasedesignparadigms•  DimensionalModeling•  Resource:TheDataWarehouseToolkit,TheCompleteGuidetoDimensionalModeling;RalphKimball&MargyRoss

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 33: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Break

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 34: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

MarkeAngandBehavioralAnalyAcs

•  Whatistheunitofanalysis?– Country– Firms– Products– Consumers/individuals

•  AggregaAonvs.Sparsity•  “BigData”makessparsitylessofaproblem

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 35: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

ProductLifeCycle(PLC)

•  StagesofproductadopAonandsales•  IntroducAon,Growth,Maturity,Decline

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 36: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

PLC–BassDiffusionModel•  ANewProductGrowthforModelConsumerDurables,Bass,F.M.,ManagementScience1969

•  AdopAonmodelofconsumerdurables

•  Pr(t):probabilityofpurchaseatAmet•  m:totalmarketsize(numberofpeople)•  Y(t):numberofpreviousbuyers•  p:innovaAon(probability)•  q:imitaAon(probability)

Pr(t) = p+ qmY (t)

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 37: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

PLC–Innovators&Imitators

0 5

10 15 20 25 30

0 5 10 15 20

Cum

ulat

ive

No.

of

Ado

pter

s

(in m

illio

ns)

Year

0.0

0.5

1.0

1.5

2.0

2.5

0 2 4 6 8 10 12 14 16 18 20

Non

-cum

ulat

ive A

dopt

ers

(in m

illio

ns)

Year

Innovators Imitators

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 38: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

PLC–CrossingtheChasm

Resource:CrossingtheChasm,GeoffreyA.Moore,1991

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 39: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Products–Supply/Demand•  Lawsofsupply&

demand•  Highdemand,high

prices–  DemandisnotstaAc

–  PromoAoncanchangedemand

•  Surplussupply,lowprices–  EfficientstockallocaAon

–  Stockoutproblems

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 40: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Products–Supply/Demand

•  Profit(margins)=Price–Cost•  Cost=fixedcost+marginalcost•  PerfectmarketcompeAAon=>efficiency•  AdverAsingandpromoAonscanincreasedemand

•  R.O.I.:ReturnonInvestment=Profit/investment

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 41: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

MarkeAngStrategy

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 42: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Product–MarketBasketAnalysis•  Lookatwhatproductsarepurchasedtogether•  AssociaAverules:correlaAonbetweenA&B

–  Prob(A|B),Prob(B|A)–  Beer&Diapers

•  Featureanalysis:eg.size,color,specificaAons•  Cross-sell

–  Upsell:sellmoreexpensive/highermarginproduct–  SubsAtutes–  RecommendaAonengines

•  Bundling:packagetwosimilarproducts–  Lowcostofbundling–  (WordPerfect&Lotus)vs.MicrosopOffice–  Convergeddevices:(PDA&phone)vs.smartphone

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 43: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Products

•  Isn’tproduct-levelanalysisperfect?•  Whatismissing?•  Whyshouldwecareaboutindividual/consumeranalysis?

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 44: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People(Consumers)•  5-stepbuyingprocess,“markeAngfunnel”:

–  Awareness:“Imightneedsoap”•  triggersincludeadverAsing

–  InformaAonsearch:“Dovesoapsoundsgood,letmefindoutmoreaboutit”

•  TargeAngandsegmentaAontogetbestinformaAontocustomers–  EvaluatealternaAves:Whichisbestforme?Withinandoutsidecategory

•  Influencerscanplaykeyrole–  Purchase:distribuAonchannel–  Evaluate(postpurchase):“DidImakeamistake?”

•  Repeatpurchase?•  ProduceposiAveword-of-mouth(WOM)

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 45: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People-CRM

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 46: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People–CRMAcquisiAon

•  AcquisiAon:–  Acquisition rate (%) = (Number of prospects acquired / Number

of prospects targeted) x 100 –  Acquisition is defined as the first purchase or purchasing in the

first predefined period –  Denotes average probability of acquiring a customer –  Always calculated for a group of customers –  Usually computed on a campaign-by-campaign basis

•  AcquisiAoncostperprospect–  Acquisition cost ($) = Acquisition spending ($) / Number of

prospects acquired –  Measured in monetary terms –  Precise values for companies targeting prospects through direct

mail –  Less precise for broadcasted communications

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 47: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People–CRMAcAvityMeasurements

•  Trackcustomersloyaltyprogram•  ObservetransacAonspercustomeroverAme•  RFM:

–  Recency:whenwasthelastpurchase–  Frequency:howopenpurchaseinaperiod– Monetary:totalvalueofsales

•  Easytocalculate•  HelpfulforsegmentaAon•  Cons:

– NotgoodforforecasAng

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 48: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People–CRMAcAvityMeasurements

•  Average inter-purchase time = 1 / Number of purchase incidences from first purchase till current time period – Measured in time periods – Evaluation of metric – Easy to calculate – Useful for industries with frequent customer

purchases – Marketing intervention might be warranted

anytime customers fall considerably below their AIT

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 49: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People–CRMRetenAon/DefecAonrates

•  Retention rate –  Average likelihood that a customer purchases in period t, given

that he/she has purchased in the last period t-1 –  Retention rate (%) = [(Number of customers in cohort buying in

period t | buying in period t-1) / Number of customers in cohort buying in period t-1] x 100

–  Retention rate (%) = 1 - (1 / Average lifetime duration) •  Defection rate

–  Average likelihood that a customer defects in period t, given that he/she has purchased in the last period t-1

–  Defection rate (%) = 1 - Retention rate –  Average lifetime duration = 1 / (1 - Average retention rate)

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 50: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People–CRMRetenAon/DefecAonrates

•  Number of retained customers in any period (t+n) = (Number of acquired customers in period t) x (Retention rate(t+n))

–  Assuming a constant retention rate among acquired customers

•  Example –  Assume a constant retention rate of 0.75, or defection rate of

0.25 –  Average lifetime duration = 4 (1 / [1 - 0.75]) –  Customers starting at beginning of year 1 = 100 –  Customers remaining at end of year 1 = 75.00 (100 x 0.751) –  Customers remaining at end of year 2 = 56.25 (100 x 0.752) –  Customers remaining at end of year 3 = 42.19 (100 x 0.753) –  Customers remaining at end of year 4 = 31.64 (100 x 0.754)

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 51: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People-CRMDefecAonRatevs.CustomerTenure

•  Variation (or heterogeneity) around average lifetime duration of 4 years

0

5

10

15

20

25

30

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Customer Tenure (Periods)

# of

Cus

tom

ers

Def

ectin

g

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 52: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People-CRMLifeAmeDuraAon

•  Less precise metric –  Average lifetime duration = 1 / (1 - Average retention rate)

•  More precise metric –  Average lifetime duration =

–  where N = cohort size, t = time period •  Complete or incomplete information on customer

–  Complete: customer’s time of first and last purchases are known

–  Incomplete: either only time of first purchase, or only time of last purchase, or both time of first and last purchases are unknown

1Number of customers retained

T

tt

t

N=

×∑

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 53: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People–CRMProbability(AcAve)

•  Probability of a customer being active in time t in a non-contractual setting – Probability(Active) = Tn – where n = number of purchases in a given

period, T = time of the last purchase (given as a fraction of the observation period)

– Simple approximation of probability(active) – More advanced computation methods exist

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 54: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

People-CRMProbability(AcAve)

•  Customer 1: T = (8/12) = 0.667 and n = 4 –  Probability(Active) = (0.667)4 = 0.198

•  Customer 2: T = (8/12) = 0.667 and n = 2 –  Probability(Active) = (0.667)2 = 0.444

Customer 1

Customer 2

Observation Period Holdout Period

Month 1 Month 12 Month 8 Month 18

X indicates that a purchase was made by a customer in that month

Dr.TuanQPHAN,NUSIS5126,(c)2017

Page 55: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

BehavioralAnalyAcs

•  Buildunderstandingofconsumerlifecycle

•  Segmentdifferentbehavior/moAvaAons

•  Separatetypesofloyalty:–  Behavioral:observedmanytransacAons

–  Autudinal:emoAonalloyalty

•  ProvidesguidancetodifferentmarkeAngeffort

•  Howtomeasureandcapturedataondifferentcustomertypes?

Page 56: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Mini-case:Taobao(e-commerce)

Page 57: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Mini-case:Buy,Search,Browse(Taobao)

•  E-commerceincreasinglypopular

•  HowcananalyAcsbuildinsightandtakeacAon?

•  Howisonlinedifferentthanofflineshopping?

•  WhataddiAonaldataisavailable?

•  Whatkindofbehaviorcanweobserve?Moe,WendyW.“Buying,Searching,orBrowsing:DifferenAaAngbetweenOnlineShoppersUsingin-StoreNavigaAonalClickstream.”JournalofConsumerPsychology13,no.1(2003):29–39.

Page 58: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Data

•  ProductinformaAon/pricing

•  TransacAons

ProductID DescripFon Size AHributes Price Date

12345 CatT-shirt L Red 15.00 Winter2016

Timestamp TransacFonID

ProductID

UserID QuanFty Price Shipping

Dec.1,2016 1 12345 tphan 2 30.00 SingPost

Dec.1,2016 1 34567 tphan 1 15.00 SingPost

Page 59: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Data

•  Clickstream– Webserver(Apache)logs

Dr.TuanQPHAN,NUSIS5126,(c)2017

Timestamp URL Client IP SessionID UserID

Dec.1,2016,00:00:01

hQp://qoo10.sg/

Firefox 192.168.1.1 12345ABCD tphan

Dec.1,2016,00:00:10

hQp://qoo10.sg/Mens_Shirts/

Firefox 192.168.1.1 12345ABCD tphan

… … … … … ….

Page 60: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Approach

• Categorizepages:• HomePage• CategoryPages• BrandPages• ProductPages• SearchPages

Page 61: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Metrics•  Avg.Amespentperpage•  %searchpages•  #categorypages•  #productpages•  Diff#Cat•  #Brand•  #Prod

Page 62: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Behaviors

•  KnowledgeBuilding

•  HedonicBuilding

•  DirectedBuying

•  Search/DeliberaAon

•  Shallowsessions

Page 63: IS5126 - HowBA - NUS Computingphantq/IS5126/AY2016_2017_2/IS5126_Lectur… · Lecture 2 – Data, Databases, SQL, Behavioral AnalyAcs; Jan 18 ... The Data Warehouse ETL Toolkit, Ralph

Admin

•  Pickupsyllabusandschedule,alsoavailableonmywebsite:hQp://www.tuanqphan.us

•  PurchaseHBSCasefromhQp://hbsp.harvard.edu– Data.gov,#9-610-075

•  Signupteamof4onIVLEbyJan.30– UseIVLEforumstofindteammates

Dr.TuanQPHAN,NUSIS5126,(c)2017