session 9 - standing on the shoulders of giantsthe html language the primary language of information...
TRANSCRIPT
IntroductiontoprogrammingusingPython
Session9MatthieuChoplin
http://moodle.city.ac.uk/
1
ObjectivesQuickreviewofwhatHTMLisThefind()stringmethodRegularexpressionsInstallingexternallibrariesUsingawebparser:BeautifulSoupSubmittingdatatoaformusingMechanicalSoupFetchingdatainrealtime
2
TheHTMLlanguageTheprimarylanguageofinformationontheinternetistheHTMLEverywebpagesarewritteninHTMLToseethesourcecodeofthewebpageyouarecurrentlyseeing,doeitherrightclickandselect"ViewpageSource".Orfromthetopmenuofyourbrowser,clickonViewand"ViewSource".
3
ExampleProfile_Aphrodite.htm
<html><head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Aphrodite</title><linkrel="stylesheet"type="text/css"></head><bodybgcolor="yellow"><center><br><br><imgsrc="./Profile_Aphrodite_files/aphrodite.gif"><h2>Name:Aphrodite</h2><br><br>Favoriteanimal:Dove<br><br>Favoritecolor:Red<br><br>Hometown:MountOlympus</center></body></html>
4
Graballhtmlfromawebpage
Whatisthetypeofobjectthatisreturned?
fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')print(html_text)
5
ParsingawebpagewithaString'smethodYoucanusethefind()methodExample:
Remix
main.py
this_is_my_string = 'Programming in python'string_to_find = input('Enter a string to find in \'%s\': ' % this_is_my_string)index_found = this_is_my_string.find(string_to_find)print(index_found)print(this_is_my_string[index_found])
12345
6
Findawordbetween2otherwords Remix
main.py
my_string = 'some text with a special word ' \ '<strong>Equanimity</strong>'start_tag = "<strong>"end_tag = "</strong>"start_index = my_string.find(start_tag) + len(start_tag)end_index = my_string.find(end_tag)# We extract the text between # the last index of the first tag '>'# and the first index of the second tag '<'print(my_string[start_index:end_index])
123456789
1011
7
Parsingthetitlewiththefind()methodfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])
8
Limitationofthefind()methodTrytousethesamescriptforextractingthetitleofProfile_Poseidon.htm
fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Poseidon.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])
9
Limitationofthefind()methodDoyouseethedifference?Wearenotgettingwhatwewantnow:
Thisisbecauseoftheextraspacebeforetheclosing">"in<title>Thehtmlisstillrenderedbythebrowser,butwecannotrelyonitcompletelyifwewanttoparseawebpage
<head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Poseidon
10
RegularexpressionsTheyareusedtodeterminewhetherornotatextmatchesaparticularpatternWecanusethemthankstotheremoduleinpythonTheyusespecialcharacterstorepresentpatterns:^,$,*,+,.,etc...
11
re.findall()using*Theasteriskcharacter*standsfor"zeroormore"ofwhatevercamejustbeforetheasteriskre.findall():
findsanytextwithinastringthatmatchesagivenpatterni.e.regextakes2arguments,the1stistheregex,the2ndisthestringtotestreturnsalistofallmatches
#re.findall(<regular_expression>,<string_to_test>)
12
Interactiveexample Remix
main.py
import re
print(re.findall("ab*c", "ac"))print(re.findall("ab*c", "abcd"))print(re.findall("ab*c", "acc"))print(re.findall("ab*c", "abcac")) # 2 foundprint(re.findall("ab*c", "abdc")) # nothing found
1234567
13
re.findall()caseinsensitiveNotethatre.findall()iscasesensitive
Wecanusea3rdargumentre.IGNORECASEtoignorethecase
re.findall('ab*c','ABC')#nothingfound
re.findall('ab*c','ABC',re.IGNORECASE)#ABCfound
14
re.findall()using.(period)theperiod.standsforanysinglecharacterinaregularexpressionforinstancewecouldfindallthestringsthatcontainstheletters"a"and"c"separatedbyasinglecharacterasfollows:
Remix
main.py
import reprint(re.findall('a.c', 'abc'))print(re.findall('a.c', 'abbc'))print(re.findall('a.c', 'ac'))print(re.findall('a.c', 'acC', re.IGNORECASE))
123456
15
re.findall()using.*(periodasterisk)theterm.*standsforanycharacterbeingrepeatedanynumberoftimesforinstancewecouldfindallthestringthatstartswith"a"andendswith"c",regardlessofwhatisinbetweenwith:
Remix
main.py
import reprint(re.findall('a.*c', 'abc'))print(re.findall('a.*c', 'abbc'))print(re.findall('a.*c', 'ac'))print(re.findall('a.*c', 'acc'))
12345
16
re.search()re.search():
searchesforaparticularpatterninsideastringreturnsaMatchObjectthatstoresdifferent"groups"ofdatawhenwecallthegroup()methodonaMatchObject,wegetthefirstandmostinclusiveresult
importrematch_results=re.search('ab*c','ABC',re.IGNORECASE)print(match_results.group())#returnsABC
17
re.sub()re.sub()
allowstoreplaceatextinastringthatmatchesapatternwithasubstitute(likethereplace()stringmethod)takes3arguments:1. regex2. replacementtext3. stringtoparse
my_string="Thisisveryboring"print(my_string.replace('boring','funny'))importreprint(re.sub('boring','WHAT?',my_string))
18
greedyregex(*)greedyexpressionstrytofindthelongestpossiblematchwhencharacterlike*areusedforinstance,inthisexampletheregexfindseverythingbetween'<'and'>'whichisactuallythewhole'<replaced>ifitisin<tags>'
my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*>','BAR',my_string)print(my_string)#'EverythingisBAR'
19
non-greedyregex(*?)*?
worksthesameas*BUTmatchestheshortestpossiblestringoftext
my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*?>','BAR',my_string)print(my_string)#'EverythingisBARifitisinBAR'
20
Usecase:UsingregextoparseawebpageWewanttoextractthetitle:
Wewillusetheregularexpressionforthiscase
Profile_Dionysus.htm
<TITLE>Profile:Dionysus</title/>
21
Usecase:solutionimportrefromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')match_results=re.search("<title.*?>.*</title.*?>",html_text,re.IGNORECASE)title=match_results.group()title=re.sub("<.*?>","",title)print(title)
22
Usecase:explanation<title.*?>findstheopeningtagwheretheremustbeaspaceaftertheword"title"andthetagmustbeclosed,butanycharacterscanappearintherestofthetag.Weusethenon-greedy*?,becausewewantthefirstclosing">"tomatchthetag'send.*anycharactercanappearinbetweenthe<title>tag<\title.*?>sameexpressionasthefirstpartbutwiththeforwardslashtorepresentaclosingHTMLtagMoreonregex:https://docs.python.org/3.5/howto/regex.html
23
InstallinganexternallibrarySometimeswhatyouneedisnotincludedinthepythonstandardlibraryandyouhavetoinstallanexternallibraryYouaregoingtouseapythonpackagemanager:Thepackages(libraries)thatyoucaninstallwithpiparelistedonIfyoudonothavepip,youcanusethecommand"pythonsetup.pyinstall"fromthepackageyouwouldhavedownloadedanduncompressedfrom
pip
https://pypi.python.org/pypi
pypi
24
InstallingwithPycharm(1)
25
InstallingwithPycharm(2)
26
InstallingwithPycharm(3)
27
UsingBeautifulSoupfrombs4importBeautifulSoupfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')my_soup=BeautifulSoup(html_text,"html.parser")
28
BeautifulSoup:get_text()get_text()
isextractingonlythetextfromanhtmldocument
therearelotofblanklinesleftbutwecanremovethemwiththemethodreplace()
UsingBeautifulSouptoextractthetextfirstandusethefind()methodissometimeseasierthantouseregularexpressions
print(my_soup.get_text())
print(my_soup.get_text().replace("\n\n\n",""))
29
BeautifulSoup:find_all()find_all()
returnsalistofallelementsofaparticulartaggiveninargument
WhatiftheHTMLpageisbroken?
print(my_soup.find_all("img"))
30
BeautifulSoup:Tags
Thisisnotwhatwewerelookingfor.The<img>isnotproperlyclosedthereforeBeautifulSoupendsupaddingafairamountofHTMLaftertheimagetagbeforeinsertinga</img>tagonitsown.Thiscanhappenwithrealcase.NB:BeautifulSoupisstoringHTMLtagsasTagobjectsandwecanextractinformationfromeachTag.
[<imgsrc="dionysus.jpg"/>,<imgsrc="grapes.png"><br><br>Hometown:MountOlympus<br><br>Favoriteanimal:Leopard<br><br>FavoriteColor:Wine</br></br></br></br></br></br></img>]
31
BeautifulSoup:ExtractinginformationfromTags
Tags:haveanamehaveattributes,accessibleusingkeys,likewhenweaccessvaluesofadictionarythroughitskeys
fortaginmy_soup.find_all("img"):print(tag.name)print(tag['src'])
32
BeautifulSoup:accessingaTagthroughitsname
TheHTMLiscleanedupWecanusethestringattributesstoredbythetitle
print(my_soup.title)
print(my_soup.title.string)
33
Theselectmethod(1)...willreturnalistofTagobjects,whichishowBeautifulSouprepresentsanHTMLelement.ThelistwillcontainoneTagobjectforeverymatchintheBeautifulSoupobject'sHTML
34
Theselectmethod(2)Selectorpassedtotheselectmethod
Willmatch...
soup.select('div') Allelementsnamed<div>
soup.select('#author') Theelementwithanidattributeofauthor
soup.select('.notice') AllelementsthatuseaCSS
soup.select('divspan') Allelementsnamed<span>thatarewithinanelementnamed<div>
soup.select('div>span') Allelementsnamed<span>thataredirectlywithinanelementnamed<div>,withnootherelementsinbetween
soup.select('input[name]') Allelementsnamed<input>thathaveanameattributewithanyvalue
soup.select('input[type="button"]') Allelementsnamed<input>thathaveanattributenametypewithvaluebutton
35
EmulatingawebbrowserSometimesweneedtosubmitinformationtoawebpage,likealoginpageWeneedawebbrowserforthat
isanalternativetourllibthatcandoallthesamethingsbuthasmoreaddedfunctionalitythatwillallowustotalkbacktowebpageswithoutusingastandalonebrowser,perfectforfetchingwebpages,clickingonbuttonsandlinks,andfillingoutandsubmittingforms
MechanicalSoup
36
InstallingMechanicalSoupYoucaninstallitwithpip:pipinstallMechanicalSouporwithinPycharm(likewhatwedidearlierwithBeautifulSoup)YoumightneedtorestartyourIDEforMechanicalSouptoloadandberecognised
37
MechanicalSoup:OpeningawebpageCreateabrowserGetawebpagewhichisaResponseobjectAccesstheHTMLcontentwiththesoupattribute
importmechanicalsoup
my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})page=my_browser.get("http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm")print(page.soup)
38
MechanicalSoup:Submittingvaluestoaform
HavealookatthisTheimportantsectionistheloginformWecanseethatthereisasubmission<form>named"login"thatincludestwo<input>tags,onenamedusernameandtheotheronenamedpassword.Thethird<input>istheactual"Submit"button
loginpage
39
MechanicalSoup:scripttologinimportmechanicalsoup
my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})login_page=my_browser.get("https://whispering-reef-69172.herokuapp.com/login")login_html=login_page.soup
form=login_html.select("form")[0]form.select("input")[0]["value"]="admin"form.select("input")[1]["value"]="default"
profiles_page=my_browser.submit(form,login_page.url)print(profiles_page.url)print(profiles_page.soup)
40
MethodsinMechanicalSoupWecreatedaBrowserobjectWecalledthemethodgetontheBrowserobjecttogetawebpageWeusedtheselect()methodtograbtheformandinputvaluesinit
41
InteractingwiththeWebinRealTimeWewanttogetdatafromawebsitethatisconstantlyupdatedWeactuallywanttosimulateclickingonthe"refresh"buttonWecandothatwiththegetmethodofMechanicalSoup
42
Usecase:fetchingthestockquotefromYahoofinance(1)
LetusidentifywhatisneededWhatisthesourceofthedata?
Whatdowewanttoextractfromthissource?Thestockprice
https://www.bloomberg.com/quote/YHOO:US
43
Usecase:fetchingthestockquotefromYahoofinance(2)
Ifwelookatthesourcecode,wecanseewhatthetagisforthestockandhowtoretrieveit:
Wecheckthat<divclass="price">onlyappearsonceinthewebpagesinceitwillbeawaytoidentifythelocationofthecurrentprice
<divclass="price">40.08</div>
44
MechanicalSoup:scripttofindYahoocurrentprice
importmechanicalsoup
my_browser=mechanicalsoup.Browser()page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswhere#thecssclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthe#first(andonly)<span>tagmy_price=my_tags[0].textprint("Thecurrentpriceof""YHOOis:{}".format(my_price))
45
RepeatedlygettheYahoocurrentpriceNowthatweknowhowtogetthepriceofastockfromtheBloombergwebpage,wecancreateaforlooptostayuptodateNotethatweshouldnotoverloadtheBloombergwebsitewithmorerequeststhanweneed.Andalso,weshouldalsohavealookattheir filetobesurethatwhatwedoisallowed
robots.txt
46
Introductiontothetime.sleep()methodThesleep()methodofthemoduletimetakesanumberofsecondsasargumentandwaitsforthisnumberofseconds,itenablestodelaytheexecutionofastatementintheprogram
fromtimeimportsleepprint"I'mabouttowaitforfiveseconds..."sleep(5)print"Donewaiting!"
47
RepeatedlygettheYahoocurrentprice:script
fromtimeimportsleepimportmechanicalsoupmy_browser=mechanicalsoup.Browser()#obtain1stockquoteperminuteforthenext3minutesforiinrange(0,3):page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswheretheclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthefirsttagmy_price=my_tags[0].textprint("ThecurrentpriceofYHOOis:{}".format(my_price))ifi<2:#waitaminuteifthisisn'tthelastrequestsleep(60)
48
Exercise:puttingitalltogetherInstallanewlibrarycalledrequestsUsing ofBeautifulSoup,parse(thatis,analyzeandidentifythepartsof)theimageofthedayof
Usingthegetmethodoftherequestslibrary,downloadtheimageCompletethefollowingprogram
theselectmethod
http://xkcd.com/
xkcd_incomplete.py
49
UsingrequestYoufirsthavetoimportit
Ifyouwanttodownloadthewebpage,usetheget()methodwithaurlinparameter,suchas:
Stopyourprogramifthereisanerrorwiththeraise_for_status()method
importrequests
res=requests.get(url)
res.raise_for_status()
50
Next?Webcrawling!FromWikipedia:AWebcrawlerisanInternetbotwhichsystematicallybrowsestheWorldWideWeb,typicallyforthepurposeofWebindexing.Howdoyounavigateawebsite?Forexample,forthe
website,howcouldyouretrieveallofitsimages?WritedownhowyouwoulddesignyourprogramWritetheprogram
http://xkcd.com/
51
SolutionfoWebCrawling Solution
52