session 9 - standing on the shoulders of giantsthe html language the primary language of information...

IntroductiontoprogrammingusingPython

Session9MatthieuChoplin

[email protected]

http://moodle.city.ac.uk/

1

http://mattchoplin.com/python_city/index.html

mailto:[email protected]

http://moodle.city.ac.uk/

http://mattchoplin.com/python_city/session900.html#

ObjectivesQuickreviewofwhatHTMLisThefind()stringmethodRegularexpressionsInstallingexternallibrariesUsingawebparser:BeautifulSoupSubmittingdatatoaformusingMechanicalSoupFetchingdatainrealtime

2



TheHTMLlanguageTheprimarylanguageofinformationontheinternetistheHTMLEverywebpagesarewritteninHTMLToseethesourcecodeofthewebpageyouarecurrentlyseeing,doeitherrightclickandselect"ViewpageSource".Orfromthetopmenuofyourbrowser,clickonViewand"ViewSource".

3



ExampleProfile_Aphrodite.htm

<html><head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Aphrodite</title><linkrel="stylesheet"type="text/css"></head><bodybgcolor="yellow"><center><br><br><imgsrc="./Profile_Aphrodite_files/aphrodite.gif"><h2>Name:Aphrodite</h2><br><br>Favoriteanimal:Dove<br><br>Favoritecolor:Red<br><br>Hometown:MountOlympus</center></body></html>

4


http://mattchoplin.com/python_city/practice/Profile_Aphrodite.htm


Graballhtmlfromawebpage

Whatisthetypeofobjectthatisreturned?

fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')print(html_text)

5



ParsingawebpagewithaString'smethodYoucanusethefind()methodExample:

Remix

main.py

this_is_my_string = 'Programming in python'string_to_find = input('Enter a string to find in \'%s\': ' % this_is_my_string)index_found = this_is_my_string.find(string_to_find)print(index_found)print(this_is_my_string[index_found])

12345

6



Findawordbetween2otherwords Remix

main.py

my_string = 'some text with a special word ' \ '<strong>Equanimity</strong>'start_tag = "<strong>"end_tag = "</strong>"start_index = my_string.find(start_tag) + len(start_tag)end_index = my_string.find(end_tag)# We extract the text between # the last index of the first tag '>'# and the first index of the second tag '<'print(my_string[start_index:end_index])

123456789

1011

7



Parsingthetitlewiththefind()methodfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])

8



Limitationofthefind()methodTrytousethesamescriptforextractingthetitleofProfile_Poseidon.htm

fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Poseidon.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])

9


http://mattchoplin.com/python_city/practice/Profile_Poseidon.htm


Limitationofthefind()methodDoyouseethedifference?Wearenotgettingwhatwewantnow:

Thisisbecauseoftheextraspacebeforetheclosing">"in<title>Thehtmlisstillrenderedbythebrowser,butwecannotrelyonitcompletelyifwewanttoparseawebpage

<head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Poseidon

10



RegularexpressionsTheyareusedtodeterminewhetherornotatextmatchesaparticularpatternWecanusethemthankstotheremoduleinpythonTheyusespecialcharacterstorepresentpatterns:^,$,*,+,.,etc...

11



re.findall()using*Theasteriskcharacter*standsfor"zeroormore"ofwhatevercamejustbeforetheasteriskre.findall():

findsanytextwithinastringthatmatchesagivenpatterni.e.regextakes2arguments,the1stistheregex,the2ndisthestringtotestreturnsalistofallmatches

#re.findall(<regular_expression>,<string_to_test>)

12



Interactiveexample Remix

main.py

import re

print(re.findall("ab*c", "ac"))print(re.findall("ab*c", "abcd"))print(re.findall("ab*c", "acc"))print(re.findall("ab*c", "abcac")) # 2 foundprint(re.findall("ab*c", "abdc")) # nothing found

1234567

13



re.findall()caseinsensitiveNotethatre.findall()iscasesensitive

Wecanusea3rdargumentre.IGNORECASEtoignorethecase

re.findall('ab*c','ABC')#nothingfound

re.findall('ab*c','ABC',re.IGNORECASE)#ABCfound

14



re.findall()using.(period)theperiod.standsforanysinglecharacterinaregularexpressionforinstancewecouldfindallthestringsthatcontainstheletters"a"and"c"separatedbyasinglecharacterasfollows:

Remix

main.py

import reprint(re.findall('a.c', 'abc'))print(re.findall('a.c', 'abbc'))print(re.findall('a.c', 'ac'))print(re.findall('a.c', 'acC', re.IGNORECASE))

123456

15



re.findall()using.*(periodasterisk)theterm.*standsforanycharacterbeingrepeatedanynumberoftimesforinstancewecouldfindallthestringthatstartswith"a"andendswith"c",regardlessofwhatisinbetweenwith:

Remix

main.py

import reprint(re.findall('a.*c', 'abc'))print(re.findall('a.*c', 'abbc'))print(re.findall('a.*c', 'ac'))print(re.findall('a.*c', 'acc'))

12345

16



re.search()re.search():

searchesforaparticularpatterninsideastringreturnsaMatchObjectthatstoresdifferent"groups"ofdatawhenwecallthegroup()methodonaMatchObject,wegetthefirstandmostinclusiveresult

importrematch_results=re.search('ab*c','ABC',re.IGNORECASE)print(match_results.group())#returnsABC

17



re.sub()re.sub()

allowstoreplaceatextinastringthatmatchesapatternwithasubstitute(likethereplace()stringmethod)takes3arguments:1. regex2. replacementtext3. stringtoparse

my_string="Thisisveryboring"print(my_string.replace('boring','funny'))importreprint(re.sub('boring','WHAT?',my_string))

18



greedyregex(*)greedyexpressionstrytofindthelongestpossiblematchwhencharacterlike*areusedforinstance,inthisexampletheregexfindseverythingbetween'<'and'>'whichisactuallythewhole'<replaced>ifitisin<tags>'

my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*>','BAR',my_string)print(my_string)#'EverythingisBAR'

19



non-greedyregex(*?)*?

worksthesameas*BUTmatchestheshortestpossiblestringoftext

my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*?>','BAR',my_string)print(my_string)#'EverythingisBARifitisinBAR'

20



Usecase:UsingregextoparseawebpageWewanttoextractthetitle:

Wewillusetheregularexpressionforthiscase

Profile_Dionysus.htm

<TITLE>Profile:Dionysus</title/>

21


http://mattchoplin.com/python_city/practice/Profile_Dionysus.htm


Usecase:solutionimportrefromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')match_results=re.search("<title.*?>.*</title.*?>",html_text,re.IGNORECASE)title=match_results.group()title=re.sub("<.*?>","",title)print(title)

22



Usecase:explanation<title.*?>findstheopeningtagwheretheremustbeaspaceaftertheword"title"andthetagmustbeclosed,butanycharacterscanappearintherestofthetag.Weusethenon-greedy*?,becausewewantthefirstclosing">"tomatchthetag'send.*anycharactercanappearinbetweenthe<title>tag<\title.*?>sameexpressionasthefirstpartbutwiththeforwardslashtorepresentaclosingHTMLtagMoreonregex:https://docs.python.org/3.5/howto/regex.html

23


https://docs.python.org/3.5/howto/regex.html


InstallinganexternallibrarySometimeswhatyouneedisnotincludedinthepythonstandardlibraryandyouhavetoinstallanexternallibraryYouaregoingtouseapythonpackagemanager:Thepackages(libraries)thatyoucaninstallwithpiparelistedonIfyoudonothavepip,youcanusethecommand"pythonsetup.pyinstall"fromthepackageyouwouldhavedownloadedanduncompressedfrom

pip

https://pypi.python.org/pypi

pypi

24


https://pip.pypa.io/en/latest/installing/




InstallingwithPycharm(1)

25




26




27



UsingBeautifulSoupfrombs4importBeautifulSoupfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')my_soup=BeautifulSoup(html_text,"html.parser")

28



BeautifulSoup:get_text()get_text()

isextractingonlythetextfromanhtmldocument

therearelotofblanklinesleftbutwecanremovethemwiththemethodreplace()

UsingBeautifulSouptoextractthetextfirstandusethefind()methodissometimeseasierthantouseregularexpressions

print(my_soup.get_text())

print(my_soup.get_text().replace("\n\n\n",""))

29



BeautifulSoup:find_all()find_all()

returnsalistofallelementsofaparticulartaggiveninargument

WhatiftheHTMLpageisbroken?

print(my_soup.find_all("img"))

30



BeautifulSoup:Tags

Thisisnotwhatwewerelookingfor.The<img>isnotproperlyclosedthereforeBeautifulSoupendsupaddingafairamountofHTMLaftertheimagetagbeforeinsertinga</img>tagonitsown.Thiscanhappenwithrealcase.NB:BeautifulSoupisstoringHTMLtagsasTagobjectsandwecanextractinformationfromeachTag.

[<imgsrc="dionysus.jpg"/>,<imgsrc="grapes.png"><br><br>Hometown:MountOlympus<br><br>Favoriteanimal:Leopard<br><br>FavoriteColor:Wine</br></br></br></br></br></br></img>]

31



BeautifulSoup:ExtractinginformationfromTags

Tags:haveanamehaveattributes,accessibleusingkeys,likewhenweaccessvaluesofadictionarythroughitskeys

fortaginmy_soup.find_all("img"):print(tag.name)print(tag['src'])

32



BeautifulSoup:accessingaTagthroughitsname

TheHTMLiscleanedupWecanusethestringattributesstoredbythetitle

print(my_soup.title)

print(my_soup.title.string)

33



Theselectmethod(1)...willreturnalistofTagobjects,whichishowBeautifulSouprepresentsanHTMLelement.ThelistwillcontainoneTagobjectforeverymatchintheBeautifulSoupobject'sHTML

34



Theselectmethod(2)Selectorpassedtotheselectmethod

Willmatch...

soup.select('div') Allelementsnamed<div>

soup.select('#author') Theelementwithanidattributeofauthor

soup.select('.notice') AllelementsthatuseaCSS

soup.select('divspan') Allelementsnamed<span>thatarewithinanelementnamed<div>

soup.select('div>span') Allelementsnamed<span>thataredirectlywithinanelementnamed<div>,withnootherelementsinbetween

soup.select('input[name]') Allelementsnamed<input>thathaveanameattributewithanyvalue

soup.select('input[type="button"]') Allelementsnamed<input>thathaveanattributenametypewithvaluebutton

35



EmulatingawebbrowserSometimesweneedtosubmitinformationtoawebpage,likealoginpageWeneedawebbrowserforthat

isanalternativetourllibthatcandoallthesamethingsbuthasmoreaddedfunctionalitythatwillallowustotalkbacktowebpageswithoutusingastandalonebrowser,perfectforfetchingwebpages,clickingonbuttonsandlinks,andfillingoutandsubmittingforms

MechanicalSoup

36


https://github.com/hickford/MechanicalSoup


InstallingMechanicalSoupYoucaninstallitwithpip:pipinstallMechanicalSouporwithinPycharm(likewhatwedidearlierwithBeautifulSoup)YoumightneedtorestartyourIDEforMechanicalSouptoloadandberecognised

37



MechanicalSoup:OpeningawebpageCreateabrowserGetawebpagewhichisaResponseobjectAccesstheHTMLcontentwiththesoupattribute

importmechanicalsoup

my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})page=my_browser.get("http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm")print(page.soup)

38



MechanicalSoup:Submittingvaluestoaform

HavealookatthisTheimportantsectionistheloginformWecanseethatthereisasubmission<form>named"login"thatincludestwo<input>tags,onenamedusernameandtheotheronenamedpassword.Thethird<input>istheactual"Submit"button

loginpage

39


https://whispering-reef-69172.herokuapp.com/login


MechanicalSoup:scripttologinimportmechanicalsoup

my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})login_page=my_browser.get("https://whispering-reef-69172.herokuapp.com/login")login_html=login_page.soup

form=login_html.select("form")[0]form.select("input")[0]["value"]="admin"form.select("input")[1]["value"]="default"

profiles_page=my_browser.submit(form,login_page.url)print(profiles_page.url)print(profiles_page.soup)

40



MethodsinMechanicalSoupWecreatedaBrowserobjectWecalledthemethodgetontheBrowserobjecttogetawebpageWeusedtheselect()methodtograbtheformandinputvaluesinit

41



InteractingwiththeWebinRealTimeWewanttogetdatafromawebsitethatisconstantlyupdatedWeactuallywanttosimulateclickingonthe"refresh"buttonWecandothatwiththegetmethodofMechanicalSoup

42



Usecase:fetchingthestockquotefromYahoofinance(1)

LetusidentifywhatisneededWhatisthesourceofthedata?

Whatdowewanttoextractfromthissource?Thestockprice

https://www.bloomberg.com/quote/YHOO:US

43


https://www.bloomberg.com/quote/YHOO:US


Usecase:fetchingthestockquotefromYahoofinance(2)

Ifwelookatthesourcecode,wecanseewhatthetagisforthestockandhowtoretrieveit:

Wecheckthat<divclass="price">onlyappearsonceinthewebpagesinceitwillbeawaytoidentifythelocationofthecurrentprice

<divclass="price">40.08</div>

44



MechanicalSoup:scripttofindYahoocurrentprice

importmechanicalsoup

my_browser=mechanicalsoup.Browser()page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswhere#thecssclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthe#first(andonly)<span>tagmy_price=my_tags[0].textprint("Thecurrentpriceof""YHOOis:{}".format(my_price))

45



RepeatedlygettheYahoocurrentpriceNowthatweknowhowtogetthepriceofastockfromtheBloombergwebpage,wecancreateaforlooptostayuptodateNotethatweshouldnotoverloadtheBloombergwebsitewithmorerequeststhanweneed.Andalso,weshouldalsohavealookattheir filetobesurethatwhatwedoisallowed

robots.txt

46


https://www.bloomberg.com/robots.txt


Introductiontothetime.sleep()methodThesleep()methodofthemoduletimetakesanumberofsecondsasargumentandwaitsforthisnumberofseconds,itenablestodelaytheexecutionofastatementintheprogram

fromtimeimportsleepprint"I'mabouttowaitforfiveseconds..."sleep(5)print"Donewaiting!"

47



RepeatedlygettheYahoocurrentprice:script

fromtimeimportsleepimportmechanicalsoupmy_browser=mechanicalsoup.Browser()#obtain1stockquoteperminuteforthenext3minutesforiinrange(0,3):page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswheretheclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthefirsttagmy_price=my_tags[0].textprint("ThecurrentpriceofYHOOis:{}".format(my_price))ifi<2:#waitaminuteifthisisn'tthelastrequestsleep(60)

48



Exercise:puttingitalltogetherInstallanewlibrarycalledrequestsUsing ofBeautifulSoup,parse(thatis,analyzeandidentifythepartsof)theimageofthedayof

Usingthegetmethodoftherequestslibrary,downloadtheimageCompletethefollowingprogram

theselectmethod

http://xkcd.com/

xkcd_incomplete.py

49


http://mattchoplin.com/python_city/session900.html#/beautifulsoup-select

http://xkcd.com/

http://mattchoplin.com/python_city/exercises/xkcd_incomplete.py


UsingrequestYoufirsthavetoimportit

Ifyouwanttodownloadthewebpage,usetheget()methodwithaurlinparameter,suchas:

Stopyourprogramifthereisanerrorwiththeraise_for_status()method

importrequests

res=requests.get(url)

res.raise_for_status()

50



Next?Webcrawling!FromWikipedia:AWebcrawlerisanInternetbotwhichsystematicallybrowsestheWorldWideWeb,typicallyforthepurposeofWebindexing.Howdoyounavigateawebsite?Forexample,forthe

website,howcouldyouretrieveallofitsimages?WritedownhowyouwoulddesignyourprogramWritetheprogram

http://xkcd.com/

51


http://xkcd.com/


SolutionfoWebCrawling Solution

52



session 9 - standing on the shoulders of giantsthe html language the primary language of information...

Documents