tapping the data deluge with r

36
Tapping the Data Deluge with R Finding and using supplemental data to add context to your analysis 1 by Jeffrey Breen Principal, Think Big Academy email: [email protected] blog: http://jeffreybreen.wordpress.com Twitter: @JeffreyBreen Code & Data on github http://bit.ly/pawdata

Upload: jeffrey-breen

Post on 09-May-2015

33.780 views

Category:

Technology


6 download

DESCRIPTION

Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012. Full code and data are available on github: http://bit.ly/pawdata

TRANSCRIPT

Page 1: Tapping the Data Deluge with R

Tapping  the  Data  Deluge  with  R

Finding  and  using  supplemental  data  to  add  context  to  your  analysis

1

by Jeffrey BreenPrincipal, Think Big Academy

email: [email protected]: http://jeffreybreen.wordpress.com

Twitter: @JeffreyBreen

Code & Data on githubhttp://bit.ly/pawdata

Page 2: Tapping the Data Deluge with R

Data data everywhere!

This may be how you picture the data deluge looks like if you work for the Economist.

But those of us who wrangle data for living know that it’s usually not so prosaic or buttoned-down, proper or quaint.

Page 3: Tapping the Data Deluge with R

Real  data  hits  us  in  the  face...

3

Real data can hit you in the face.

Yet we keep coming back for more.

Page 4: Tapping the Data Deluge with R

...and  then  there’s  Big  Data.

4

And I’m not even going to talk about Big Data tonight. (For a change!)

Page 5: Tapping the Data Deluge with R

Finding  the  right  data  makes  all  the  difference

5

Tonight we’re going to look at a few different places to find those data sets which can make a difference, and a few techniques to access them so you can incorporate them into your analysis.

Page 6: Tapping the Data Deluge with R

The  two  types  of  data

Data  you  haveData  you  don’t  have...  yet

6

Perhaps you’ve heard the joke: There are two kinds of people: People who think there are two kinds of people and people who don’t.

I like to think that there are two kinds of data.

Page 7: Tapping the Data Deluge with R

The  two  types  of  data

• Data  you  have– CSV  files,  spreadsheets– files  from  other  sta>s>cs  packages  (SPSS,  SAS,  Stata,...)– databases,  data  warehouses  (SQL,  NoSQL,  HBase,...)– whatever  your  boss  emailed  you  on  his  way  to  lunch– datasets  within  R  and  R  packages

• Data  you  don’t  have...  yet– file  downloads  &  web  scraping– data  marketplaces  and  other  APIs

7Code & Data on github: http://bit.ly/pawdata

Page 8: Tapping the Data Deluge with R

Reading  CSV  files  is  easy$ head -5 data/mpg-3-13-2012.csv | cut -c 1-60"Model Yr","Mfr Name","Division","Carline","Verify Mfr Cd","2012,"aston martin","Aston Martin Lagonda Ltd","V12 Vantage"2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",

data = read.csv('data/mpg-3-13-2012.csv')

View(data)

8see R/01-read.csv-mpg.R

Page 9: Tapping the Data Deluge with R

But  so  is  reading  Excel  files  directlylibrary(XLConnect)

wb = loadWorkbook("data/mpg.xlsx", create=F)

data = readWorksheet(wb, sheet='3-7-2012')

9see R/02-XLConnect-mpg.R

Page 10: Tapping the Data Deluge with R

“foreign”  file  formatslibrary(foreign)

sav.file = file.path(system.file(package='foreign'), 'tests', 'sample100.sav')spss.data = read.spss(sav.file)

xpt.file = file.path(system.file(package='foreign'), 'tests', 'test.xpt')sas.data = read.xport(xpt.file)

dta.file = file.path(system.file(package='foreign'), 'tests', 'auto8.dta')stata.data = read.dta(dta.file)

dbf.file = file.path(system.file(package='foreign'), 'files', 'sids.dbf')dbf.data = read.dbf(dbf.file)

10see R/03-foreign.R

Page 11: Tapping the Data Deluge with R

RelaMonal  databaseslibrary(RMySQL)

con = dbConnect(MySQL(), user="root", dbname="test")

data = dbGetQuery(con, "select * from airport")

dbDisconnect(con)

View(data)

11

airport_code airport_name location state_code country_name time_zone_code1 ATL WILLIAM B. HARTSFIELD ATLANTA,GEORGIA GA USA EST2 BOS LOGAN INTERNATIONAL BOSTON,MASSACHUSETTS MA USA EST3 BWI BALTIMORE/WASHINGTON INTERNATIONAL BALTIMORE,MARYLAND MD USA EST4 DEN STAPLETON INTERNATIONAL DENVER,COLORADO CO USA MST5 DFW DALLAS/FORT WORTH INTERNATIONAL DALLAS/FT. WORTH,TEXAS TX USA CST6 OAK METROPOLITAN OAKLAND INTERNATIONAL OAKLAND,CALIFORNIA CA USA PST7 PHL PHILADELPHIA INTERNATIONAL PHILADELPHIA PA/WILM'TON,DE PA USA EST8 PIT GREATER PITTSBURGH PITTSBURGH,PENNSYLVANIA PA USA EST9 SFO SAN FRANCISCO INTERNATIONAL SAN FRANCISCO,CALIFORNIA CA USA PST

see R/04-RMySQL-airport.R

Page 12: Tapping the Data Deluge with R

Non-­‐relaMonal  databases  too> library(rhbase)> hb.init(serialize='raw')> x = hb.get(tablename='tweets', rows='221325531868692480')> str(x)List of 1 $ :List of 3 ..$ : chr "221325531868692480" ..$ : chr [1:10] "created:" "favorited:" "id:" "replyToSID:" ... ..$ :List of 10 .. ..$ : chr "2012-07-06 19:31:33" .. ..$ : chr "FALSE" .. ..$ : chr "221325531868692480" .. ..$ : chr "NA" .. ..$ : chr "NA" .. ..$ : chr "NA" .. ..$ : chr "arnicas" .. ..$ : chr "<a href="http://www.tweetdeck.com" rel="nofollow">TweetDeck</a>" .. ..$ : chr "RT @bycoffe: From @DrewLinzer, an #Rstats function for querying the HuffPost Pollster API. http://t.co/fXnG32JX cc @thewhyaxis" .. ..$ : chr "FALSE"

12

Page 13: Tapping the Data Deluge with R

weird  emails  from  the  bosscon = textConnection('# Hi:## Please invite these paid volunteers to the spontaneous rally at 3PM today:#Name Department "Hourly Rate" emailAlice Operations 32 [email protected] Logistics 5 [email protected] Records 20 [email protected]##Thanks,#Your Boss#! ! ! ! ! ')

data = read.table(con, header=T, comment.char='#')close.connection(con)

View(data)

13

Name Department Hourly.Rate email1 Alice Operations 32 [email protected] Billy Logistics 5 [email protected] Winston Records 20 [email protected]

see R/05-textConnection-email.R

Page 14: Tapping the Data Deluge with R

> data()

Data sets in package ‘datasets’:

AirPassengers Monthly Airline Passenger Numbers 1949-1960BJsales Sales Data with Leading IndicatorBJsales.lead (BJsales) Sales Data with Leading IndicatorBOD Biochemical Oxygen DemandCO2 Carbon Dioxide Uptake in Grass PlantsChickWeight Weight versus age of chicks on different dietsDNase Elisa assay of DNaseEuStockMarkets Daily Closing Prices of Major European Stock Indices, 1991-1998Formaldehyde Determination of FormaldehydeHairEyeColor Hair and Eye Color of Statistics StudentsHarman23.cor Harman Example 2.3Harman74.cor Harman Example 7.4Indometh Pharmacokinetics of IndomethacinInsectSprays Effectiveness of Insect SpraysJohnsonJohnson Quarterly Earnings per Johnson & Johnson ShareLakeHuron Level of Lake Huron 1875-1972LifeCycleSavings Intercountry Life-Cycle Savings DataLoblolly Growth of Loblolly pine treesNile Flow of the River NileOrange Growth of Orange TreesOrchardSprays Potency of Orchard SpraysPlantGrowth Results from an Experiment on Plant GrowthPuromycin Reaction Velocity of an Enzymatic ReactionSeatbelts Road Casualties in Great Britain 1969-84Theoph Pharmacokinetics of TheophyllineTitanic Survival of passengers on the TitanicToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea PigsUCBAdmissions Student Admissions at UC BerkeleyUKDriverDeaths Road Casualties in Great Britain 1969-84UKgas UK Quarterly Gas ConsumptionUSAccDeaths Accidental Deaths in the US 1973-1978USArrests Violent Crime Rates by US StateUSJudgeRatings Lawyers' Ratings of State Judges in the US Superior CourtUSPersonalExpenditure Personal Expenditure DataVADeaths Death Rates in Virginia (1940)WWWusage Internet Usage per MinuteWorldPhones The World's Telephonesability.cov Ability and Intelligence Testsairmiles Passenger Miles on Commercial US Airlines, 1937-1960airquality New York Air Quality Measurements[...]

Page 15: Tapping the Data Deluge with R

> library(zipcode)> data(zipcode)> str(zipcode)'data.frame': 44336 obs. of 5 variables: $ zip : chr "00210" "00211" "00212" "00213" ... $ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ... $ state : chr "NH" "NH" "NH" "NH" ... $ latitude : num 43 43 43 43 43 ... $ longitude: num -71 -71 -71 -71 -71 ...> subset(zipcode, city=='Boston' & state=='MA') zip city state latitude longitude664 02101 Boston MA 42.37057 -71.02696665 02102 Boston MA 42.33895 -70.91963666 02103 Boston MA 42.33895 -70.91963667 02104 Boston MA 42.33895 -70.91963668 02105 Boston MA 42.33895 -70.91963669 02106 Boston MA 42.35432 -71.07345670 02107 Boston MA 42.33895 -70.91963671 02108 Boston MA 42.35790 -71.06408672 02109 Boston MA 42.36148 -71.05417673 02110 Boston MA 42.35653 -71.05365674 02111 Boston MA 42.34984 -71.06101675 02112 Boston MA 42.33895 -70.91963676 02113 Boston MA 42.36503 -71.05636677 02114 Boston MA 42.36179 -71.06774678 02115 Boston MA 42.34308 -71.09268679 02116 Boston MA 42.34962 -71.07372680 02117 Boston MA 42.33895 -70.91963681 02118 Boston MA 42.33872 -71.07276682 02119 Boston MA 42.32451 -71.08455683 02120 Boston MA 42.33210 -71.09651684 02121 Boston MA 42.30745 -71.08127685 02122 Boston MA 42.29630 -71.05454686 02123 Boston MA 42.33895 -70.91963687 02124 Boston MA 42.28713 -71.07156688 02125 Boston MA 42.31685 -71.05811690 02127 Boston MA 42.33499 -71.04562691 02128 Boston MA 42.37830 -71.02550696 02133 Boston MA 42.33895 -70.91963726 02163 Boston MA 42.36795 -71.12056757 02196 Boston MA 42.33895 -70.91963[...]

Page 16: Tapping the Data Deluge with R

image credit: http://njarb.com/2012/08/untangle-this-mess-of-wires/

Now let’s turn our attention to tapping into the internet for other data sources

Page 17: Tapping the Data Deluge with R

The  two  types  of  data

• Data  you  have– CSV  files,  spreadsheets– files  from  other  sta>s>cs  packages  (SPSS,  SAS,  Stata,...)– databases,  data  warehouses  (SQL,  NoSQL,  HBase,...)– whatever  your  boss  emailed  you  on  his  way  to  lunch– datasets  within  R  and  R  packages

• Data  you  don’t  have...  yet– file  downloads  &  web  scraping– data  marketplaces  and  other  APIs

17Code & Data on github: http://bit.ly/pawdata

Page 18: Tapping the Data Deluge with R
Page 19: Tapping the Data Deluge with R
Page 21: Tapping the Data Deluge with R
Page 22: Tapping the Data Deluge with R

download.file()  if  URLs  aren’t  supported

library(XLConnect)

url = "http://www.fueleconomy.gov/feg/EPAGreenGuide/xls/all_alpha_12.xls"local.xls.file = 'data/all_alpha_12.xls'

download.file(url, local.xls.file)

wb = loadWorkbook(local.xls.file, create=F)data = readWorksheet(wb, sheet='all_alpha_12')

View(data)

22see R/07-download.file-XLConnect-green.R

Page 23: Tapping the Data Deluge with R

image credit: http://groovynoms.com/2011/07/25/beer-of-the-week-2/

Now, I don’t mean to oversell this next one, but if you’ve spent as much time as I have finding -- and trying to deal with -- interesting data sets on web pages, you might agree that this next function alone is worth the price of admission.

Page 24: Tapping the Data Deluge with R

not  even  HTML  tables  are  safelibrary(XML)url = 'http://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'state.capitals.df = readHTMLTable(url, which=2)

24see R/08-readHTMLTable.R

State Abr. Date of statehood Capital Capital since Land area (mi²) Most populous city? Municipal population1 Alabama AL 1819 Montgomery 1846 155.4 No 205,7642 Alaska AK 1959 Juneau 1906 2716.7 No 31,2753 Arizona AZ 1912 Phoenix 1889 474.9 Yes 1,445,6324 Arkansas AR 1836 Little Rock 1821 116.2 Yes 193,5245 California CA 1850 Sacramento 1854 97.2 No 466,4886 Colorado CO 1876 Denver 1867 153.4 Yes 600,1587 Connecticut CT 1788 Hartford 1875 17.3 No 124,5128 Delaware DE 1787 Dover 1777 22.4 No 36,0479 Florida FL 1845 Tallahassee 1824 95.7 No 181,412

10 Georgia GA 1788 Atlanta 1868 131.7 Yes 420,003

As you’d expect from a package called “XML”, it parses well-formed XML files.

But I didn’t expect it would do such a good job with HTML.

And I certainly didn’t expect to find a function as handy as readHTMLTable()!

Page 25: Tapping the Data Deluge with R

image credit: http://www.ebaypartnernetworkblog.com/en/files/2011/05/api1.gif

Page 26: Tapping the Data Deluge with R

The  DataMarket  Is  Open...

26

Page 27: Tapping the Data Deluge with R

..and  couldn’t  be  easier  to  access.

library(rdatamarket)

oil.prod = dmseries("http://data.is/nyFeP9")

plot(oil.prod)

27see R/09-rdatamarket.RDataMarket includes its own URL shortner -- like bit.ly but just for their data.

Long or short, just give dmseries() the URL, and it will download the data set for you.

Page 28: Tapping the Data Deluge with R

Make  a  withdrawal  from  the  World  Bank

> library(WDI)> WDIsearch('population, total') indicator name "SP.POP.TOTL" "Population, total"

> WDIsearch('fertility .*total') indicator name "SP.DYN.TFRT.IN" "Fertility rate, total (births per woman)"

> WDIsearch('life expectancy .*birth.*total') indicator name "SP.DYN.LE00.IN" "Life expectancy at birth, total (years)"

> WDIsearch('GDP per capita .*constant') indicator name [1,] "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)"[2,] "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)"

> WDIsearch('population, total') indicator name "SP.POP.TOTL" "Population, total"

28see R/10-WDI.R

Page 29: Tapping the Data Deluge with R

Swedish  Accent  Not  Includeddata = WDI(country=c('BR', 'CN', 'GB', 'JP', 'IN', 'SE', 'US'), ! ! ! indicator=c('SP.DYN.TFRT.IN', 'SP.DYN.LE00.IN', 'SP.POP.TOTL', ! ! ! ! ! ! 'NY.GDP.PCAP.KD'), ! ! ! start=1900, end=2010)

library(googleVis)g = gvisMotionChart(data, idvar='country', timevar='year')plot(g)

29see R/10-WDI.R

Page 30: Tapping the Data Deluge with R

quantmod:  the  king  of  symbols

• getSymbols()  downloads  Mme  series  data  from  source  specified  by  “src”  parameter:– yahoo  =  Yahoo!  Finance– google  =  Google  Finance– FRED  =  St.  Louis  Fed’s  Federal  Reserve  Economic  Data– oanda  =  OANDA  Forex  Trading  &  Exchange  Rates– csv–MySQL– RData

30

Page 31: Tapping the Data Deluge with R

Hello,  FRED55,000  economic  +me  series  from  45  sources:• AutomaMc  Data  Processing,  Inc.

• Banca  d'Italia

• Banco  de  Mexico

• Bank  of  Japan

• Bankrate,  Inc.

• Board  of  Governors  of  the  Federal  Reserve  System

• BofA  Merrill  Lynch

• BriMsh  Bankers'  AssociaMon

• Central  Bank  of  the  Republic  of  Turkey

• Chicago  Board  OpMons  Exchange

• CredAbility  Nonprofit  Credit  Counseling  &  EducaMon

• Deutsche  Bundesbank

• Dow  Jones  &  Company

• Eurostat

• Federal  Financial  InsMtuMons  ExaminaMon  Council

• Federal  Housing  Finance  Agency

• Federal  Reserve  Bank  of  Chicago

• Federal  Reserve  Bank  of  Kansas  City

• Federal  Reserve  Bank  of  Philadelphia

• Federal  Reserve  Bank  of  St.  Louis

• Freddie  Mac

• Haver  AnalyMcs

• InsMtute  for  Supply  Management

• InternaMonal  Monetary  Fund

• London  Bullion  Market  AssociaMon

• NaMonal  AssociaMon  of  Realtors

• NaMonal  Bureau  of  Economic  Research

• OrganisaMon  for  Economic  Co-­‐operaMon  and  Development

• Reserve  Bank  of  Australia

• Standard  and  Poor's

• Swiss  NaMonal  Bank

• The  White  House:  Council  of  Economic  Advisors

• The  White  House:  Office  of  Management  and  Budget

• Thomson  Reuters/University  of  Michigan

• U.S.  Congress:  Congressional  Budget  Office

• U.S.  Department  of  Commerce:  Bureau  of  Economic  Analysis

• U.S.  Department  of  Commerce:  Census  Bureau

• U.S.  Department  of  Energy:  Energy  InformaMon  AdministraMon

• U.S.  Department  of  Housing  and  Urban  Development

• U.S.  Department  of  Labor:  Bureau  of  Labor  StaMsMcs

• U.S.  Department  of  Labor:  Employment  and  Training  AdministraMon

• U.S.  Department  of  the  Treasury:  Financial  Management  Service

• U.S.  Department  of  TransportaMon:  Federal  Highway  AdministraMon

• Wilshire  Associates  Incorporated

• World  Bank31

Page 32: Tapping the Data Deluge with R

BLS  Jobless  data  (FRED)  +  S&P  (Yahoo!)library(quantmod)

initial.claims = getSymbols('ICSA', src='FRED', auto.assign=F)

sp500 = getSymbols('^GSPC', src='yahoo', auto.assign=F)

# Convert quotes to weekly and fetch Cl() closing pricesp500.weekly = Cl(to.weekly(sp500))

32see R/11-quantmod.R

Page 33: Tapping the Data Deluge with R

Resources• Expanded  code  snippets  and  all  data  for  this  talk

– http://bit.ly/pawdata

• R  Data  Import/Export  manual– http://cran.r-project.org/doc/manuals/R-data.html

• CRAN:  Comprehensive  R  Archive  Network– package  lists:  http://cran.r-project.org/web/packages/– Featured:  XLConnect,  foreign,  RMySQL,  XML,  quantmod,  rdatamarket,  WDI,  quantmod

– Database:  RODBC,  DBI,  RJDBC,  ROracle,  RPostgreSQL,  RSQLite,  RMongo,  RCassandra– Data  sets:  zipcode,  agridat,  GANPAdata    – Data  access:  crn,  rgbif,  RISmed,  govdat,  myepisodes,  msProstate,  corpora

• rhbase  from  the  RHadoop  project– https://github.com/RevolutionAnalytics/RHadoop

33

Page 34: Tapping the Data Deluge with R

When  I  first  said  that  R  is  my  “Swiss  Army  Knife”  for  data,  you  might  have  pictured  this:

Page 35: Tapping the Data Deluge with R

but  now  you  know  I  was  really  thinking  this:

Page 36: Tapping the Data Deluge with R

Thank  you!

36

by Jeffrey BreenPrincipal, Think Big Academy

email: [email protected]: http://jeffreybreen.wordpress.com

Twitter: @JeffreyBreen

Code & Data on githubhttp://bit.ly/pawdata