session 9 - standing on the shoulders of giantsthe html language the primary language of information...

52
Introduction to programming using Python Session 9 Matthieu Choplin [email protected] http://moodle.city.ac.uk/ 1

Upload: others

Post on 30-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

ObjectivesQuickreviewofwhatHTMLisThefind()stringmethodRegularexpressionsInstallingexternallibrariesUsingawebparser:BeautifulSoupSubmittingdatatoaformusingMechanicalSoupFetchingdatainrealtime

2

Page 3: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

TheHTMLlanguageTheprimarylanguageofinformationontheinternetistheHTMLEverywebpagesarewritteninHTMLToseethesourcecodeofthewebpageyouarecurrentlyseeing,doeitherrightclickandselect"ViewpageSource".Orfromthetopmenuofyourbrowser,clickonViewand"ViewSource".

3

Page 4: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

ExampleProfile_Aphrodite.htm

<html><head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Aphrodite</title><linkrel="stylesheet"type="text/css"></head><bodybgcolor="yellow"><center><br><br><imgsrc="./Profile_Aphrodite_files/aphrodite.gif"><h2>Name:Aphrodite</h2><br><br>Favoriteanimal:Dove<br><br>Favoritecolor:Red<br><br>Hometown:MountOlympus</center></body></html>

4

Page 5: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Graballhtmlfromawebpage

Whatisthetypeofobjectthatisreturned?

fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')print(html_text)

5

Page 6: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

ParsingawebpagewithaString'smethodYoucanusethefind()methodExample:

Remix

main.py

this_is_my_string = 'Programming in python'string_to_find = input('Enter a string to find in \'%s\': ' % this_is_my_string)index_found = this_is_my_string.find(string_to_find)print(index_found)print(this_is_my_string[index_found])

12345

6

Page 7: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Findawordbetween2otherwords Remix

main.py

my_string = 'some text with a special word ' \ '<strong>Equanimity</strong>'start_tag = "<strong>"end_tag = "</strong>"start_index = my_string.find(start_tag) + len(start_tag)end_index = my_string.find(end_tag)# We extract the text between # the last index of the first tag '>'# and the first index of the second tag '<'print(my_string[start_index:end_index])

123456789

1011

7

Page 8: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Parsingthetitlewiththefind()methodfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])

8

Page 9: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Limitationofthefind()methodTrytousethesamescriptforextractingthetitleofProfile_Poseidon.htm

fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Poseidon.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])

9

Page 10: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Limitationofthefind()methodDoyouseethedifference?Wearenotgettingwhatwewantnow:

Thisisbecauseoftheextraspacebeforetheclosing">"in<title>Thehtmlisstillrenderedbythebrowser,butwecannotrelyonitcompletelyifwewanttoparseawebpage

<head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Poseidon

10

Page 11: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

RegularexpressionsTheyareusedtodeterminewhetherornotatextmatchesaparticularpatternWecanusethemthankstotheremoduleinpythonTheyusespecialcharacterstorepresentpatterns:^,$,*,+,.,etc...

11

Page 12: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

re.findall()using*Theasteriskcharacter*standsfor"zeroormore"ofwhatevercamejustbeforetheasteriskre.findall():

findsanytextwithinastringthatmatchesagivenpatterni.e.regextakes2arguments,the1stistheregex,the2ndisthestringtotestreturnsalistofallmatches

#re.findall(<regular_expression>,<string_to_test>)

12

Page 13: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Interactiveexample Remix

main.py

import re

print(re.findall("ab*c", "ac"))print(re.findall("ab*c", "abcd"))print(re.findall("ab*c", "acc"))print(re.findall("ab*c", "abcac")) # 2 foundprint(re.findall("ab*c", "abdc")) # nothing found

1234567

13

Page 14: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

re.findall()caseinsensitiveNotethatre.findall()iscasesensitive

Wecanusea3rdargumentre.IGNORECASEtoignorethecase

re.findall('ab*c','ABC')#nothingfound

re.findall('ab*c','ABC',re.IGNORECASE)#ABCfound

14

Page 15: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

re.findall()using.(period)theperiod.standsforanysinglecharacterinaregularexpressionforinstancewecouldfindallthestringsthatcontainstheletters"a"and"c"separatedbyasinglecharacterasfollows:

Remix

main.py

import reprint(re.findall('a.c', 'abc'))print(re.findall('a.c', 'abbc'))print(re.findall('a.c', 'ac'))print(re.findall('a.c', 'acC', re.IGNORECASE))

123456

15

Page 16: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

re.findall()using.*(periodasterisk)theterm.*standsforanycharacterbeingrepeatedanynumberoftimesforinstancewecouldfindallthestringthatstartswith"a"andendswith"c",regardlessofwhatisinbetweenwith:

Remix

main.py

import reprint(re.findall('a.*c', 'abc'))print(re.findall('a.*c', 'abbc'))print(re.findall('a.*c', 'ac'))print(re.findall('a.*c', 'acc'))

12345

16

Page 17: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

re.search()re.search():

searchesforaparticularpatterninsideastringreturnsaMatchObjectthatstoresdifferent"groups"ofdatawhenwecallthegroup()methodonaMatchObject,wegetthefirstandmostinclusiveresult

importrematch_results=re.search('ab*c','ABC',re.IGNORECASE)print(match_results.group())#returnsABC

17

Page 18: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

re.sub()re.sub()

allowstoreplaceatextinastringthatmatchesapatternwithasubstitute(likethereplace()stringmethod)takes3arguments:1. regex2. replacementtext3. stringtoparse

my_string="Thisisveryboring"print(my_string.replace('boring','funny'))importreprint(re.sub('boring','WHAT?',my_string))

18

Page 19: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

greedyregex(*)greedyexpressionstrytofindthelongestpossiblematchwhencharacterlike*areusedforinstance,inthisexampletheregexfindseverythingbetween'<'and'>'whichisactuallythewhole'<replaced>ifitisin<tags>'

my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*>','BAR',my_string)print(my_string)#'EverythingisBAR'

19

Page 20: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

non-greedyregex(*?)*?

worksthesameas*BUTmatchestheshortestpossiblestringoftext

my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*?>','BAR',my_string)print(my_string)#'EverythingisBARifitisinBAR'

20

Page 21: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Usecase:UsingregextoparseawebpageWewanttoextractthetitle:

Wewillusetheregularexpressionforthiscase

Profile_Dionysus.htm

<TITLE>Profile:Dionysus</title/>

21

Page 22: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Usecase:solutionimportrefromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')match_results=re.search("<title.*?>.*</title.*?>",html_text,re.IGNORECASE)title=match_results.group()title=re.sub("<.*?>","",title)print(title)

22

Page 23: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Usecase:explanation<title.*?>findstheopeningtagwheretheremustbeaspaceaftertheword"title"andthetagmustbeclosed,butanycharacterscanappearintherestofthetag.Weusethenon-greedy*?,becausewewantthefirstclosing">"tomatchthetag'send.*anycharactercanappearinbetweenthe<title>tag<\title.*?>sameexpressionasthefirstpartbutwiththeforwardslashtorepresentaclosingHTMLtagMoreonregex:https://docs.python.org/3.5/howto/regex.html

23

Page 24: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

InstallinganexternallibrarySometimeswhatyouneedisnotincludedinthepythonstandardlibraryandyouhavetoinstallanexternallibraryYouaregoingtouseapythonpackagemanager:Thepackages(libraries)thatyoucaninstallwithpiparelistedonIfyoudonothavepip,youcanusethecommand"pythonsetup.pyinstall"fromthepackageyouwouldhavedownloadedanduncompressedfrom

pip

https://pypi.python.org/pypi

pypi

24

Page 28: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

UsingBeautifulSoupfrombs4importBeautifulSoupfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')my_soup=BeautifulSoup(html_text,"html.parser")

28

Page 29: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

BeautifulSoup:get_text()get_text()

isextractingonlythetextfromanhtmldocument

therearelotofblanklinesleftbutwecanremovethemwiththemethodreplace()

UsingBeautifulSouptoextractthetextfirstandusethefind()methodissometimeseasierthantouseregularexpressions

print(my_soup.get_text())

print(my_soup.get_text().replace("\n\n\n",""))

29

Page 30: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

BeautifulSoup:find_all()find_all()

returnsalistofallelementsofaparticulartaggiveninargument

WhatiftheHTMLpageisbroken?

print(my_soup.find_all("img"))

30

Page 31: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

BeautifulSoup:Tags

Thisisnotwhatwewerelookingfor.The<img>isnotproperlyclosedthereforeBeautifulSoupendsupaddingafairamountofHTMLaftertheimagetagbeforeinsertinga</img>tagonitsown.Thiscanhappenwithrealcase.NB:BeautifulSoupisstoringHTMLtagsasTagobjectsandwecanextractinformationfromeachTag.

[<imgsrc="dionysus.jpg"/>,<imgsrc="grapes.png"><br><br>Hometown:MountOlympus<br><br>Favoriteanimal:Leopard<br><br>FavoriteColor:Wine</br></br></br></br></br></br></img>]

31

Page 32: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

BeautifulSoup:ExtractinginformationfromTags

Tags:haveanamehaveattributes,accessibleusingkeys,likewhenweaccessvaluesofadictionarythroughitskeys

fortaginmy_soup.find_all("img"):print(tag.name)print(tag['src'])

32

Page 33: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

BeautifulSoup:accessingaTagthroughitsname

TheHTMLiscleanedupWecanusethestringattributesstoredbythetitle

print(my_soup.title)

print(my_soup.title.string)

33

Page 34: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Theselectmethod(1)...willreturnalistofTagobjects,whichishowBeautifulSouprepresentsanHTMLelement.ThelistwillcontainoneTagobjectforeverymatchintheBeautifulSoupobject'sHTML

34

Page 35: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Theselectmethod(2)Selectorpassedtotheselectmethod

Willmatch...

soup.select('div') Allelementsnamed<div>

soup.select('#author') Theelementwithanidattributeofauthor

soup.select('.notice') AllelementsthatuseaCSS

soup.select('divspan') Allelementsnamed<span>thatarewithinanelementnamed<div>

soup.select('div>span') Allelementsnamed<span>thataredirectlywithinanelementnamed<div>,withnootherelementsinbetween

soup.select('input[name]') Allelementsnamed<input>thathaveanameattributewithanyvalue

soup.select('input[type="button"]') Allelementsnamed<input>thathaveanattributenametypewithvaluebutton

35

Page 36: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

EmulatingawebbrowserSometimesweneedtosubmitinformationtoawebpage,likealoginpageWeneedawebbrowserforthat

isanalternativetourllibthatcandoallthesamethingsbuthasmoreaddedfunctionalitythatwillallowustotalkbacktowebpageswithoutusingastandalonebrowser,perfectforfetchingwebpages,clickingonbuttonsandlinks,andfillingoutandsubmittingforms

MechanicalSoup

36

Page 37: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

InstallingMechanicalSoupYoucaninstallitwithpip:pipinstallMechanicalSouporwithinPycharm(likewhatwedidearlierwithBeautifulSoup)YoumightneedtorestartyourIDEforMechanicalSouptoloadandberecognised

37

Page 38: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

MechanicalSoup:OpeningawebpageCreateabrowserGetawebpagewhichisaResponseobjectAccesstheHTMLcontentwiththesoupattribute

importmechanicalsoup

my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})page=my_browser.get("http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm")print(page.soup)

38

Page 39: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

MechanicalSoup:Submittingvaluestoaform

HavealookatthisTheimportantsectionistheloginformWecanseethatthereisasubmission<form>named"login"thatincludestwo<input>tags,onenamedusernameandtheotheronenamedpassword.Thethird<input>istheactual"Submit"button

loginpage

39

Page 40: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

MechanicalSoup:scripttologinimportmechanicalsoup

my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})login_page=my_browser.get("https://whispering-reef-69172.herokuapp.com/login")login_html=login_page.soup

form=login_html.select("form")[0]form.select("input")[0]["value"]="admin"form.select("input")[1]["value"]="default"

profiles_page=my_browser.submit(form,login_page.url)print(profiles_page.url)print(profiles_page.soup)

40

Page 41: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

MethodsinMechanicalSoupWecreatedaBrowserobjectWecalledthemethodgetontheBrowserobjecttogetawebpageWeusedtheselect()methodtograbtheformandinputvaluesinit

41

Page 42: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

InteractingwiththeWebinRealTimeWewanttogetdatafromawebsitethatisconstantlyupdatedWeactuallywanttosimulateclickingonthe"refresh"buttonWecandothatwiththegetmethodofMechanicalSoup

42

Page 43: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Usecase:fetchingthestockquotefromYahoofinance(1)

LetusidentifywhatisneededWhatisthesourceofthedata?

Whatdowewanttoextractfromthissource?Thestockprice

https://www.bloomberg.com/quote/YHOO:US

43

Page 44: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Usecase:fetchingthestockquotefromYahoofinance(2)

Ifwelookatthesourcecode,wecanseewhatthetagisforthestockandhowtoretrieveit:

Wecheckthat<divclass="price">onlyappearsonceinthewebpagesinceitwillbeawaytoidentifythelocationofthecurrentprice

<divclass="price">40.08</div>

44

Page 45: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

MechanicalSoup:scripttofindYahoocurrentprice

importmechanicalsoup

my_browser=mechanicalsoup.Browser()page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswhere#thecssclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthe#first(andonly)<span>tagmy_price=my_tags[0].textprint("Thecurrentpriceof""YHOOis:{}".format(my_price))

45

Page 46: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

RepeatedlygettheYahoocurrentpriceNowthatweknowhowtogetthepriceofastockfromtheBloombergwebpage,wecancreateaforlooptostayuptodateNotethatweshouldnotoverloadtheBloombergwebsitewithmorerequeststhanweneed.Andalso,weshouldalsohavealookattheir filetobesurethatwhatwedoisallowed

robots.txt

46

Page 47: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Introductiontothetime.sleep()methodThesleep()methodofthemoduletimetakesanumberofsecondsasargumentandwaitsforthisnumberofseconds,itenablestodelaytheexecutionofastatementintheprogram

fromtimeimportsleepprint"I'mabouttowaitforfiveseconds..."sleep(5)print"Donewaiting!"

47

Page 48: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

RepeatedlygettheYahoocurrentprice:script

fromtimeimportsleepimportmechanicalsoupmy_browser=mechanicalsoup.Browser()#obtain1stockquoteperminuteforthenext3minutesforiinrange(0,3):page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswheretheclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthefirsttagmy_price=my_tags[0].textprint("ThecurrentpriceofYHOOis:{}".format(my_price))ifi<2:#waitaminuteifthisisn'tthelastrequestsleep(60)

48

Page 49: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Exercise:puttingitalltogetherInstallanewlibrarycalledrequestsUsing ofBeautifulSoup,parse(thatis,analyzeandidentifythepartsof)theimageofthedayof

Usingthegetmethodoftherequestslibrary,downloadtheimageCompletethefollowingprogram

theselectmethod

http://xkcd.com/

xkcd_incomplete.py

49

Page 50: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

UsingrequestYoufirsthavetoimportit

Ifyouwanttodownloadthewebpage,usetheget()methodwithaurlinparameter,suchas:

Stopyourprogramifthereisanerrorwiththeraise_for_status()method

importrequests

res=requests.get(url)

res.raise_for_status()

50

Page 51: Session 9 - Standing on the shoulders of giantsThe HTML language The primary language of information on the internet is the HTML Every webpages are written in HTML To see the source

Next?Webcrawling!FromWikipedia:AWebcrawlerisanInternetbotwhichsystematicallybrowsestheWorldWideWeb,typicallyforthepurposeofWebindexing.Howdoyounavigateawebsite?Forexample,forthe

website,howcouldyouretrieveallofitsimages?WritedownhowyouwoulddesignyourprogramWritetheprogram

http://xkcd.com/

51