session 3 wharton summer tech camp

Post on 14-Jan-2016

34 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Session 3 Wharton Summer Tech Camp. Data Acquisition: Companies & Wharton Data Basic web scraping Using APIs. Set up problems. Mac– mostly no problems due to linux -like environment and great support Windows on MOBAXTERM You can use apt - cyg to install everything - PowerPoint PPT Presentation

TRANSCRIPT

Data Acquisition:Companies & Wharton Data

Basic web scrapingUsing APIs

Session 3Wharton Summer Tech Camp

Set up problems

• Mac– mostly no problems due to linux-like environment and great support

• Windows on MOBAXTERM– You can use apt-cyg to install everything

– Apt-cyg install python– Apt-cyg install idle– Apt-cyg install idlex

REGEX CHALLENGE! • 3 REGEX Challenges• 1 from a well known t-shirt joke (if you know this,

don’t say anything) • 2 are song lyrics (tried to find well known songs). • Raise your hand to say the answer

a t-shirt people wear

r”(bb|[^b]{2})”

Difficulty *Hint: Phrase

a t-shirt people wear

r”(bb|[^b]{2})”

“To be or not to be”

Difficulty *Hint: Phrase

Challenge 2

Difficulty *****Hint: This is literally the entire lyric for the song

r”(\w+ [a-z]{3} w..ld ){144}”

Challenge 2

Difficulty ****Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year

r”(ar\w{3} [a-z]{3} w..ld ){144}”

Challenge 2

Difficulty *****Hint: This is literally the entire lyric for the song Hint 2: It’s a song by the music duo who created the latest Record of the Year

r”(\w+ [a-z]{3} w..ld ){144}”

Around the world – by Daft Punk

Challenge 3

Difficulty **Hint: Lyric of an old song

r”ah, ((ba ){4} (bar){2}a an{2} \s)+”

Difficulty **

r”ah, ((ba ){4} (bar){2}a an{2} \s)+”

Ah, Ba ba ba ba Barbara Ann~ Ah, Ba ba ba ba Barbara Ann~

Challenge 3

Song PhrasesEver since I learned regex, I was thinking that many Daft Punk songs are optimized for regex.

Lyrics for a song in its entirety with this one simple regex • r”(Around the world ){144}” – Around the world• r"((buy|use|break|fix|trash|change) it )+ now upgrade

it” –Technologic• r”(((work|make|do|makes|more) (it|us|than) (harder|

better|faster|stronger|ever))+ hour after our work is never over. \s)+” – Harder, better, faster, stronger

THE BIGGEST concern for doctoral students doing empirical work (year 2-4)“WHERE AND HOW DO I GET THE DATA?!“

Mr. Data: “I believe what you are experiencing is frustration”

Data sources1.Companies2.Wharton Organizations3.Scraping Web4.APIs : application

programming interface

DATA SOURCES

1. Companies – HARD, UNIQUE– Hardest but once you get a good company, you are set for a

paper or two or more…2. Wharton Organizations – (WRDS) (EASY, COMMON - great for auxiliary data) Other

people can also easily access this data. Data probably have been used already

– (WCAI) (EASY, UNIQUE) data is actually pretty great and only few select teams get it after proposal review process

3. Scraping Web (WGET/REGEX/tools) – MEDIUM, MEDIUM– Relatively easy but painful for big projects and sometimes

not allowed based on website.4. APIs : application programming interface – EASY, COMMON– Easy but restricted to what the company made available.

Resources for Public Data

• There are many list of lists for public data• Find a link to list of lists for data in

the course website under “resources for learning”• If you have a good source, please

email me so I can link it on the web

Companies

Quick tips• Don’t be afraid to contact random companies • Attend conferences and network like an MBA - think of it like a game • Send a short 2-3 page proposal suggesting a research collaboration • Read about the company you are contacting and make sure to offer

something that interests the company • Low success probability – among many proposals I’ve sent (about 30+

if you count emails).– Mostly no response. – 1 company I was working with for 10 months just decided to drop

the ball due to CTO changing twice.– 4 very easy data – not useful and suitable for research– 2 very useful data I am currently using/working with. – 1 company disputing about NDA

• NDAs: you can request help from upenn legal team here – https://medley05.isc-seo.upenn.edu/researchInventory/jsp/

fast2.do?bhcp=1

NDAs are super important• A horror story I heard– A student worked with a

company for 1+ year and then the company just decided that the result was too good to publish. Wanted it to be a trade secret/IP.

– NDA signed was bad.– No publication.– Most NDAs are OK but some

are not. If bad, get help from that link and negotiate.

– Look out for “Work for hire” type of NDAs

Wharton Specific

Wharton Specific

You probably heard about these organization from wharton doctoral orientation.• WRDS: Wharton Research Data Services – https://wrds-web.wharton.upenn.edu/wrds/

• WCAI: Wharton Customer Analytics Initiative– http://www.wharton.upenn.edu/wcai/

• Other organizations exist but mostly for conferences and not for data.– http://www.wharton.upenn.edu/faculty/research-c

enters-and-initiativ.cfm

Basic Web Scraping

Caveats

• I spent time writing and testing a scraping code for this course where one inputs a list of music artists in csv format and the script queries allmusic.com to obtain information such as the genres associated with the artists.

• Written in March of 2013. • On July, It broke because allmusic.com has updated

their website… • This is one problem with scraping. You never know

when it will stop working and you have to rewrite.

Outline of basic scraping

1. CRAWLING: Instead of using web browsers, use scripts to access html (xml, etc). Or crawl through website recursively and download all htmls or txts or whatever. (WGET or Python or any language such as php)

2. PATTERN SEARCHING: Researcher looks at the raw http output and looks for where the required data is and figure out what the pattern is. (Developer’s toolbox Firefox)

3. EXTRACTION: Use text extracting tool to extract information and store it! (if it’s structured format such as xml then use appropriate tools for each format). (REGEX, Apache Lucene, SED, AWK, etc)

4. Go publish papers with the data

Alternatives

• Want something easier or with GUI? – MOZENDA: Wharton has license and it’s cheap

• More advanced scraping – We will cover this next week with Scrapy

• There are many other tools and packages for this.– http://en.wikipedia.org/wiki/Web_crawler– http://stackoverflow.com/questions/419235/anyone-kno

w-of-a-good-python-based-web-crawler-that-i-could-use

Tools used in our examples

• WGET + Python• REGEX• HTML/DOM inspector –Firefox has Web Developer's Toolbox

which is an add-on you can download. –This is useful for looking for pattern of

data you want to extract

Scraping Example 1

• Facebook SEC filing exploration–Purpose: Exploration before research–What this toy example is doing: Get SEC

filing for Facebook and extract certain parts– I am interested in reading a few words

before and after whenever there is “shares” mentioned

DOWNLOAD HTMLS/TXT/JPG/ETC

• WGET“GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.”

Fire up edgarFBarchive.sh and extractPhrase.py

WGET FB’s SEC filings

wget -r -l1 -H -t1 -nd -N -np -A.txt –e robots=off http://www.sec.gov/Archives/edgar/data/1326801/

-r -H -l1 -np These options tell wget to download recursively.-nd no directory. Keep the downloaded in one folder-A.txt only download txt files -erobots=off ignore robot.txt (avoid using this option if wget without this option works. Make sure to use --wait option if you use this option or your IP may get banned)

Caveats• WGET only works well for certain websites. You can use it

download all photos etc. But if your script makes too many requests, they may ban your IP. You can specify delayed requests.

• Once website gets fancy, you have to use other tools such as PHP or Python packages – ASP– POST (as opposed to GET protocol in HTTP)– Javascript produced cites – AJAX cites

• This is a toy example for learning. You can still use this method for simple scraping but consider learning pro tools (we’ll cover basics of a such tool next week)

Scraping Example 2

• Jambase.com concert venues–This example gets a list of artists and

queries jambase.com to get concert venue information.–Another toy example

Fire up getConcertVenue.py

API ( Application Programming Interface)

Programmable Web

• programmableweb.com– Search engine for freely available APIs online – http://blog.programmableweb.com/2012/02/15/

40-real-estate-apis-zillow-trulia-walk-score/

– Usage examples

• Usually, you have to apply for API keys from the website or the company offering the data

• Mostly free (limited queries)

Idea behind API

1. You obtain a key from the company offering the data

2. Make requests for data – Many different ways based on API

3. Company server grants you the data 4. Data analysis

Commonly Used Protocol in API• REST (REpresentational State Transfer) – guidelines for client-server interaction for

exchanging data as opposed to the alternative SOAP • I recommend this funny explanation for REST vs SOAP (diagram involving Martin

Lawrence)– http://stackoverflow.com/questions/209905/representational-state-transfer-rest-and-simple-object-

access-protocol-soap

• Based on HTTP• You request data via HTTP GET

(http://www.w3schools.com/tags/ref_httpmethods.asp) protocol and server will give you data – HTTP-URL?QueryStrings – QueryStrings: Field=Value separated by &– E.g. http://www.youtube.com/watch?v=5pidokakU4I&t=0m38s– v: stands for video = some value – t: stands for start time= some value

• Usual Data formats – XML eXtensible Markup Language http://www.w3schools.com/xml/– JSON JavaScript Object Notationhttp://www.w3schools.com/json/

XML Example<CATALOG>

<PLANT><COMMON>Bloodroot</COMMON><BOTANICAL>Sanguinaria canadensis</BOTANICAL><ZONE>4</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$2.44</PRICE><AVAILABILITY>031599</AVAILABILITY>

</PLANT><PLANT>

<COMMON>Columbine</COMMON><BOTANICAL>Aquilegia canadensis</BOTANICAL><ZONE>3</ZONE><LIGHT>Mostly Shady</LIGHT><PRICE>$9.37</PRICE><AVAILABILITY>030699</AVAILABILITY>

</PLANT>

</CATALOG>

Many xml related packageshttp://wiki.python.org/moin/PythonXml

JSON Example (just like python)

newObject = { "first": "Ted", "last": "Logan", "age": 17, "sex": "M", "salary": 0, "registered": false, "interests": ["Van Halen", "Being Excellent", "Partying"]}

Main python moduleimport json

Yahoo Finance Data Example

Python Package Wrapper

• Yahoo provides simple web interface for anyone to download stock information via url– http://finance.yahoo.com/d/quotes.csv?s=%s&f=%s– s: symbol “GOOG”– f: stat (e.g. l1 means last trade price)

• http://finance.yahoo.com/d/quotes.csv?s=GOOG&f=l1 • More info here

– http://www.gummy-stuff.org/Yahoo-data.htm Ordered to take down

– http://web.archive.org/web/20140325063520/http://www.gummy-stuff.org/Yahoo-data.htm

This Wrapper Package does it for you

• ystockquote– https://pypi.python.org/pypi/ystockquote/0.2.3– https://github.com/cgoldberg/ystockquote

• See the simple source code to learn• Open up ystock.py

Example: YQL

• http://developer.yahoo.com/yql/• APIs are written by individual companies and support

different I/O and usually different languages. • Yahoo Query Language is a simple interface that yahoo

has made available to developers combining several APIs

• “Yahoo! Query Language (YQL) enables you to access Internet data with SQL-like commands.”

• Apply for your API Key – http://developer.yahoo.com/yql/

Our example: BBYOPEN

• https://bbyopen.com/bbyopen-apis-overview• Retail information

– Archive query - Returns a single file containing all attributes for all items exposed by the given API

– Basic query - Returns information about a single item– Advanced query - Returns information about one or more items

according to your specifications– Store availability query - Returns information about products

available at specific storesBest buy is providing this API

• API overview – https://developer.bestbuy.com/get-started

Basic QueryBasic query structurehttp://api.remix.bestbuy.com/API/Item.Format?show=&apiKey=Key API - One of {products, stores, reviews, categories} Item - The value of the fundamental attribute for the selected API:

o products - skuo stores - storeIdo reviews - ido categories - id

Format - One of {xml, json} show= - (optional) The item attributes you want displayed Key - Your API keyNote: show= and Key can be specified in either order.

Basic Query Examples

API example

• Open up bestbuyAPI.py

Lab session

• For the next 10-15 minutes, choose your favorite website and try to scrape a few items

• We’ll do this again with scrapy

Data isn’t impossibly hard to get after all. There are many routes but it could take a LONG time

(especially if are going the company route). START EARLY and you’ll get that data.

DATA!

Next Session

• Hugh will be speaking about HPCC

• After that, we will learn the basics of Scrapy

• Brush up on your HTML and look into XPATH– W3school.com is the best

• Intro into Big Data and Empirical Business Research

top related