python and web data extraction: introduction2016/07/02 · – web scraping with python: collecting...
TRANSCRIPT
![Page 1: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/1.jpg)
PythonandWebDataExtraction:Introduction
Alvin Zuyin [email protected]
http://community.mis.temple.edu/zuyinzheng/
![Page 2: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/2.jpg)
Outline
• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile
• Tutorial2:ExtractingTextualDatafrom10-K
![Page 3: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/3.jpg)
Webscrapingtypicallyconsistof
Step1.Fetchingawebpage
Step2.Downloadingthe
webpage(Optional)
Step3.Extractinginformationfromthewebpage
Step4.Storinginformationinafile
![Page 4: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/4.jpg)
Example:10-K
URL:https://www.sec.gov/Archives/edgar/data/1288776/000165204416000012/goog10-k2015.htm
![Page 5: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/5.jpg)
Example:TablewithLinks
URL:https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100
https://www.sec.gov/edgar/searchedgar/companysearch.html
![Page 6: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/6.jpg)
Outline
• Overview• StepsinWebScraping– FetchingaWebpage– Downloadingthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile
• Tutorial2:ExtractingTextualDatafrom10-K
![Page 7: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/7.jpg)
FetchingaWebpage• Usetheurllib2 packagetoopenawebpage– Donotneedtoinstallmanually
>>> import urllib2
>>> urlLink = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100"
>>> pageRequest = urllib2.Request(urlLink)
>>> pageOpen = urllib2.urlopen(pageRequest)
>>> pageRead = pageOpen.read()
>>>
![Page 8: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/8.jpg)
Outline
• Overview• StepsinWebScraping– FetchingaWebpage– Downloadingthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile
• Tutorial2:ExtractingTextualDatafrom10-K
![Page 9: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/9.jpg)
DownloadingaWebpage
• Weoftenwanttodownload thewebpagesbecause– Wewanttolimitwebrequests– Websitesmaychangeovertime– Wewanttoreplicateresearch
>>> os.chdir('/Users/alvinzuyinzheng/Dropbox/PythonWorkshop/scripts/') #Change your working directory
>>> htmlname = "goog10-k2015.htm"
>>> htmlfile = open(htmlname, "wb")
>>> htmlfile.write(pageRead)
>>> htmlfile.close()
![Page 10: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/10.jpg)
Outline
• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile
• Tutorial2:ExtractingTextualDatafrom10-K
![Page 11: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/11.jpg)
WhatisHTML
• Whenperformingwebdataextraction,wedealwithHTMLfiles– HyperTextMarkupLanguage
• HTMLspecifies asetof tags thatidentifystructureandcontenttypeofwebpages– Tags:surroundedbyanglebrackets<>– mosttagscomeinpairs,markingabeginningandending
• E.g., <title> and</title> enclosethetitleofapage
![Page 12: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/12.jpg)
HTMLLayout
<html>
<head><title>HTML examples</title></head>
<body><h1>HTML demonstration</h1><p>This is a sample HTML page.</p></body>
</html>
html_example.html Openinabrowser:
MoreonHTMLtags:checkHTMLtutorialfromW3schools.
![Page 13: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/13.jpg)
ViewHTMLSourceCode
• ToinspecttheHTMLpageindetails,youcandooneofthefollowing:
– InFirefox/Chrome:Rightclick>ViewPageSource
– OpentheHTMLfileinatexteditor(eg,Notepad++)
![Page 14: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/14.jpg)
Example:10-K
Viewthewebpageinbrowser
![Page 15: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/15.jpg)
Example:10-KHTMLsourcecode:
![Page 16: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/16.jpg)
InspectElements
• ToinspectaspecificelementontheHTMLpage,youcandooneofthefollowing:
• InChrome:Rightclickontheelement> Inspect
• InFirefox:Rightclickontheelement> InspectElement
![Page 17: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/17.jpg)
Example:TablewithLinks
![Page 18: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/18.jpg)
Example:TablewithLinksHTMLsourcecode:
![Page 19: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/19.jpg)
WaystoExtractDatafromHTML
• Thebs4(BeautifulSoup)Package– UsedforpullingdataoutofHTMLandXMLfiles
• There(regularexpression)Package– CanbeusedforbothHTMLandplaintextfiles
![Page 20: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/20.jpg)
Thebs4(BeautifulSoup)Package
• Installingthepackageinyourcommandlineinterface:pip install beautifulsoup4
• ImportthepackageinPythonfrom bs4 import BeautifulSoup
VisitheretolearnmoreaboutBeautifulSoup:https://www.crummy.com/software/BeautifulSoup/
![Page 21: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/21.jpg)
Example:ExtractingLinksfromaTable
![Page 22: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/22.jpg)
FetchingtheWebpagewithurllib2
>>> import urllib2
>>> urlLink = “https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100"
>>> pageRequest = urllib2.Request(urlLink)
>>> pageOpen = urllib2.urlopen(pageRequest)
>>> pageRead = pageOpen.read()
>>>
InPython:
https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=GOOG&type=10-K&dateb=&owner=exclude&count=100
![Page 23: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/23.jpg)
ExtractingtheLinkswithBeautifulSoup
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(pageRead,"html.parser")
>>> table = soup.find("table",{"class":"tableFile2"})
>>> links = []
>>> for row in table.findAll("tr"):
... cells = row.findAll("td")
... if len(cells)==5:
... link=cells[1].find("a",{"id":"documentsbutton"})
... docLink="https://www.sec.gov"+link['href']
... links.append(docLink)
InPython:
![Page 24: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/24.jpg)
ExtractingtheLinkswithBeautifulSoup
https://www.sec.gov/Archives/edgar/data/1288776/000119312516520367/0001193125-16-520367-index.htm
https://www.sec.gov/Archives/edgar/data/1288776/000165204416000012/0001652044-16-000012-index.htm
https://www.sec.gov/Archives/edgar/data/1288776/000128877615000008/0001288776-15-000008-index.htm
https://www.sec.gov/Archives/edgar/data/1288776/000128877614000020/0001288776-14-000020-index.htm
……
Whatwewillget:
![Page 25: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/25.jpg)
• WetalkedaboutRegularExpressions– Powerfultextmanipulationtoolforsearching,replacing,andparsingtextpatterns
• InPython,youneedtoloadthe“re”package
ExtractingTextualDataUsingre
>>> import re
![Page 26: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/26.jpg)
Example:Item1of10-K
<div style="text-align:left;font-size:10pt;"><font style="font-family:Arial;font-size:10pt;font-weight:bold;">ITEM 1.</font>
</div>
Inspectelement “ITEM1.”
HTMLsourcecode:
Havingthesubtitle “ITEM1.”inboldmakessurethatitis inthesubtitle,notinmaintext
![Page 27: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/27.jpg)
ExtractingTextualDataUsingre# assume we have pre-processed the webpage
# and the page content is stored in a variable “page”
>>> import re
>>> regex="bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\."
>>> match = re.search(regex, page, flags=re.IGNORECASE)
#returns everything between “Item 1.” and “Item 1A.”
>>> match.group(1)
![Page 28: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/28.jpg)
AnatomyoftheREPattern• Weusedthefollowingpattern:
'bold;\">\s*Item 1\.(.+?)bold;\">\s*Item 1A\.'
“Item1.” “Item1A.”
(.+?)representseverything elseinbetween thatwillbeextracted
• Whatdoeseachelementmean?RegularExpression CorrespondingText
\" "
\s Whitespace(suchasspace,tab,newline)
* Repeatstheprecedingcharacterzeroormoretimes
\. .(dot)
![Page 29: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/29.jpg)
MoreBasicPatternsSymbols Meaning^ Matches the beginning of a line$ Matches the end of the line. (dot) Matches any character but a whitespace\s Matches a single whitespace* Repeats a character zero or more times+ Repeats a character one or more times? Repeats a character zero or 1 times[xyz] Matches any of x, y, z\w Matches a letter or digit or underbar\d Matches a digit [0-9]() Indicates where string extraction is
to start and end
![Page 30: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/30.jpg)
Outline
• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile
• Tutorial2:ExtractingTextualDatafrom10-K
![Page 31: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/31.jpg)
Storinginformationtoacsvfile# Previously we have a list of links extracted
# and stored in a variable “links”
>>> import csv
>>> csvOutput = open("IndexLinks.csv", "wb")
>>> csvWriter = csv.writer(csvOutput, quoting = csv.QUOTE_NONNUMERIC)
>>> for link in links:
... csvWriter.writerow([link])
>>> csvOutput.close()
![Page 32: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/32.jpg)
Outline
• Overview• StepsinWebScraping– FetchingaWebpage– Downloadthewebpage– Extractinginformation fromthewebpage– Storinginformationinafile
• Tutorial2:ExtractingTextualDatafrom10-K
![Page 33: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/33.jpg)
Tutorial2:ExtractingTextualDatafrom10-K
• InstalltheBeautifulSouppackagepip install beautifulsoup4
• Downloadthefollowingfilesfromourwebsite,andputthemintothesamefolder• 1GetIndexLinks.py• 2Get10kLinks.py• 3DownLoadHTML.py• 4ReadHTML.py• CompanyList.csv
![Page 34: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/34.jpg)
Tutorial2:ExtractingTextualDatafrom10-K
• ChangingWorkingDirectory– Foreachofthefourscripts,changetheworkingdirectorytowhereyouputthecompanylist(CompanyList.csv)bychangingthefollowingline:
os.chdir(‘/Users/alvinzuyinzheng/Dropbox/PythonWorkshop/scripts/')
• RuneachPythonScriptone-by-one
![Page 35: Python and Web Data Extraction: Introduction2016/07/02 · – Web Scraping with Python: Collecting Data from the Modern Web (by Ryan Mitchell) – Mining the Social Web: Data Mining](https://reader034.vdocument.in/reader034/viewer/2022050119/5f4fd02a88c95601cb5f1456/html5/thumbnails/35.jpg)
OtherResources
• Books:– WebScrapingwithPython:CollectingDatafromtheModernWeb(byRyanMitchell)
– MiningtheSocialWeb:DataMiningFacebook,Twitter,LinkedIn,Google+,GitHub,andMore(byMatthewA.Russell)
• BeautifulSoupDocumentation:– https://www.crummy.com/software/BeautifulSoup/bs4/doc/