session 9 - standing on the shoulders of giantsthe html language the primary language of information...

IntroductiontoprogrammingusingPython

Session9MatthieuChoplin

matthieu.choplin@city.ac.uk

http://moodle.city.ac.uk/

ObjectivesQuickreviewofwhatHTMLisThefind()stringmethodRegularexpressionsInstallingexternallibrariesUsingawebparser:BeautifulSoupSubmittingdatatoaformusingMechanicalSoupFetchingdatainrealtime

TheHTMLlanguageTheprimarylanguageofinformationontheinternetistheHTMLEverywebpagesarewritteninHTMLToseethesourcecodeofthewebpageyouarecurrentlyseeing,doeitherrightclickandselect"ViewpageSource".Orfromthetopmenuofyourbrowser,clickonViewand"ViewSource".

ExampleProfile_Aphrodite.htm

<html><head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Aphrodite</title><linkrel="stylesheet"type="text/css"></head><bodybgcolor="yellow"><center><br><br><imgsrc="./Profile_Aphrodite_files/aphrodite.gif"><h2>Name:Aphrodite</h2><br><br>Favoriteanimal:Dove<br><br>Favoritecolor:Red<br><br>Hometown:MountOlympus</center></body></html>

Graballhtmlfromawebpage

Whatisthetypeofobjectthatisreturned?

fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')print(html_text)

ParsingawebpagewithaString'smethodYoucanusethefind()methodExample:

main.py

this_is_my_string = 'Programming in python'string_to_find = input('Enter a string to find in \'%s\': ' % this_is_my_string)index_found = this_is_my_string.find(string_to_find)print(index_found)print(this_is_my_string[index_found])

Findawordbetween2otherwords Remix

main.py

my_string = 'some text with a special word ' \ '<strong>Equanimity</strong>'start_tag = "<strong>"end_tag = "</strong>"start_index = my_string.find(start_tag) + len(start_tag)end_index = my_string.find(end_tag)# We extract the text between # the last index of the first tag '>'# and the first index of the second tag '<'print(my_string[start_index:end_index])

123456789

Parsingthetitlewiththefind()methodfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])

Limitationofthefind()methodTrytousethesamescriptforextractingthetitleofProfile_Poseidon.htm

fromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Poseidon.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')start_tag="<title>"end_tag="</title>"start_index=html_text.find(start_tag)+len(start_tag)end_index=html_text.find(end_tag)print(html_text[start_index:end_index])

Limitationofthefind()methodDoyouseethedifference?Wearenotgettingwhatwewantnow:

Thisisbecauseoftheextraspacebeforetheclosing">"in<title>Thehtmlisstillrenderedbythebrowser,butwecannotrelyonitcompletelyifwewanttoparseawebpage

<head><metahttp-equiv="Content-Type"content="text/html;charset=windows-1252"><title>Profile:Poseidon

RegularexpressionsTheyareusedtodeterminewhetherornotatextmatchesaparticularpatternWecanusethemthankstotheremoduleinpythonTheyusespecialcharacterstorepresentpatterns:^,$,*,+,.,etc...

re.findall()using*Theasteriskcharacter*standsfor"zeroormore"ofwhatevercamejustbeforetheasteriskre.findall():

findsanytextwithinastringthatmatchesagivenpatterni.e.regextakes2arguments,the1stistheregex,the2ndisthestringtotestreturnsalistofallmatches

#re.findall(<regular_expression>,<string_to_test>)

Interactiveexample Remix

main.py

import re

print(re.findall("ab*c", "ac"))print(re.findall("ab*c", "abcd"))print(re.findall("ab*c", "acc"))print(re.findall("ab*c", "abcac")) # 2 foundprint(re.findall("ab*c", "abdc")) # nothing found

1234567

re.findall()caseinsensitiveNotethatre.findall()iscasesensitive

Wecanusea3rdargumentre.IGNORECASEtoignorethecase

re.findall('ab*c','ABC')#nothingfound

re.findall('ab*c','ABC',re.IGNORECASE)#ABCfound

re.findall()using.(period)theperiod.standsforanysinglecharacterinaregularexpressionforinstancewecouldfindallthestringsthatcontainstheletters"a"and"c"separatedbyasinglecharacterasfollows:

main.py

import reprint(re.findall('a.c', 'abc'))print(re.findall('a.c', 'abbc'))print(re.findall('a.c', 'ac'))print(re.findall('a.c', 'acC', re.IGNORECASE))

123456

re.findall()using.*(periodasterisk)theterm.*standsforanycharacterbeingrepeatedanynumberoftimesforinstancewecouldfindallthestringthatstartswith"a"andendswith"c",regardlessofwhatisinbetweenwith:

main.py

import reprint(re.findall('a.*c', 'abc'))print(re.findall('a.*c', 'abbc'))print(re.findall('a.*c', 'ac'))print(re.findall('a.*c', 'acc'))

re.search()re.search():

searchesforaparticularpatterninsideastringreturnsaMatchObjectthatstoresdifferent"groups"ofdatawhenwecallthegroup()methodonaMatchObject,wegetthefirstandmostinclusiveresult

importrematch_results=re.search('ab*c','ABC',re.IGNORECASE)print(match_results.group())#returnsABC

re.sub()re.sub()

allowstoreplaceatextinastringthatmatchesapatternwithasubstitute(likethereplace()stringmethod)takes3arguments:1. regex2. replacementtext3. stringtoparse

my_string="Thisisveryboring"print(my_string.replace('boring','funny'))importreprint(re.sub('boring','WHAT?',my_string))

greedyregex(*)greedyexpressionstrytofindthelongestpossiblematchwhencharacterlike*areusedforinstance,inthisexampletheregexfindseverythingbetween'<'and'>'whichisactuallythewhole'<replaced>ifitisin<tags>'

my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*>','BAR',my_string)print(my_string)#'EverythingisBAR'

non-greedyregex(*?)*?

worksthesameas*BUTmatchestheshortestpossiblestringoftext

my_string='Everythingis<replaced>ifitisin<tags>'my_string=re.sub('<.*?>','BAR',my_string)print(my_string)#'EverythingisBARifitisinBAR'

Usecase:UsingregextoparseawebpageWewanttoextractthetitle:

Wewillusetheregularexpressionforthiscase

Profile_Dionysus.htm

<TITLE>Profile:Dionysus</title/>

Usecase:solutionimportrefromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')match_results=re.search("<title.*?>.*</title.*?>",html_text,re.IGNORECASE)title=match_results.group()title=re.sub("<.*?>","",title)print(title)

Usecase:explanation<title.*?>findstheopeningtagwheretheremustbeaspaceaftertheword"title"andthetagmustbeclosed,butanycharacterscanappearintherestofthetag.Weusethenon-greedy*?,becausewewantthefirstclosing">"tomatchthetag'send.*anycharactercanappearinbetweenthe<title>tag<\title.*?>sameexpressionasthefirstpartbutwiththeforwardslashtorepresentaclosingHTMLtagMoreonregex:https://docs.python.org/3.5/howto/regex.html

InstallinganexternallibrarySometimeswhatyouneedisnotincludedinthepythonstandardlibraryandyouhavetoinstallanexternallibraryYouaregoingtouseapythonpackagemanager:Thepackages(libraries)thatyoucaninstallwithpiparelistedonIfyoudonothavepip,youcanusethecommand"pythonsetup.pyinstall"fromthepackageyouwouldhavedownloadedanduncompressedfrom

https://pypi.python.org/pypi

InstallingwithPycharm(1)

UsingBeautifulSoupfrombs4importBeautifulSoupfromurllib.requestimporturlopenmy_address="http://mattchoplin.com/python_city/"\"practice/Profile_Dionysus.htm"html_page=urlopen(my_address)html_text=html_page.read().decode('utf-8')my_soup=BeautifulSoup(html_text,"html.parser")

BeautifulSoup:get_text()get_text()

isextractingonlythetextfromanhtmldocument

therearelotofblanklinesleftbutwecanremovethemwiththemethodreplace()

UsingBeautifulSouptoextractthetextfirstandusethefind()methodissometimeseasierthantouseregularexpressions

print(my_soup.get_text())

print(my_soup.get_text().replace("\n\n\n",""))

BeautifulSoup:find_all()find_all()

returnsalistofallelementsofaparticulartaggiveninargument

WhatiftheHTMLpageisbroken?

print(my_soup.find_all("img"))

BeautifulSoup:Tags

Thisisnotwhatwewerelookingfor.The<img>isnotproperlyclosedthereforeBeautifulSoupendsupaddingafairamountofHTMLaftertheimagetagbeforeinsertinga</img>tagonitsown.Thiscanhappenwithrealcase.NB:BeautifulSoupisstoringHTMLtagsasTagobjectsandwecanextractinformationfromeachTag.

[<imgsrc="dionysus.jpg"/>,<imgsrc="grapes.png"><br><br>Hometown:MountOlympus<br><br>Favoriteanimal:Leopard<br><br>FavoriteColor:Wine</br></br></br></br></br></br></img>]

BeautifulSoup:ExtractinginformationfromTags

Tags:haveanamehaveattributes,accessibleusingkeys,likewhenweaccessvaluesofadictionarythroughitskeys

fortaginmy_soup.find_all("img"):print(tag.name)print(tag['src'])

BeautifulSoup:accessingaTagthroughitsname

TheHTMLiscleanedupWecanusethestringattributesstoredbythetitle

print(my_soup.title)

print(my_soup.title.string)

Theselectmethod(1)...willreturnalistofTagobjects,whichishowBeautifulSouprepresentsanHTMLelement.ThelistwillcontainoneTagobjectforeverymatchintheBeautifulSoupobject'sHTML

Theselectmethod(2)Selectorpassedtotheselectmethod

Willmatch...

soup.select('div') Allelementsnamed<div>

soup.select('#author') Theelementwithanidattributeofauthor

soup.select('.notice') AllelementsthatuseaCSS

soup.select('divspan') Allelementsnamed<span>thatarewithinanelementnamed<div>

soup.select('div>span') Allelementsnamed<span>thataredirectlywithinanelementnamed<div>,withnootherelementsinbetween

soup.select('input[name]') Allelementsnamed<input>thathaveanameattributewithanyvalue

soup.select('input[type="button"]') Allelementsnamed<input>thathaveanattributenametypewithvaluebutton

EmulatingawebbrowserSometimesweneedtosubmitinformationtoawebpage,likealoginpageWeneedawebbrowserforthat

isanalternativetourllibthatcandoallthesamethingsbuthasmoreaddedfunctionalitythatwillallowustotalkbacktowebpageswithoutusingastandalonebrowser,perfectforfetchingwebpages,clickingonbuttonsandlinks,andfillingoutandsubmittingforms

MechanicalSoup

InstallingMechanicalSoupYoucaninstallitwithpip:pipinstallMechanicalSouporwithinPycharm(likewhatwedidearlierwithBeautifulSoup)YoumightneedtorestartyourIDEforMechanicalSouptoloadandberecognised

MechanicalSoup:OpeningawebpageCreateabrowserGetawebpagewhichisaResponseobjectAccesstheHTMLcontentwiththesoupattribute

importmechanicalsoup

my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})page=my_browser.get("http://mattchoplin.com/python_city/"\"practice/Profile_Aphrodite.htm")print(page.soup)

MechanicalSoup:Submittingvaluestoaform

HavealookatthisTheimportantsectionistheloginformWecanseethatthereisasubmission<form>named"login"thatincludestwo<input>tags,onenamedusernameandtheotheronenamedpassword.Thethird<input>istheactual"Submit"button

loginpage

MechanicalSoup:scripttologinimportmechanicalsoup

my_browser=mechanicalsoup.Browser(soup_config={'features':'html.parser'})login_page=my_browser.get("https://whispering-reef-69172.herokuapp.com/login")login_html=login_page.soup

form=login_html.select("form")[0]form.select("input")[0]["value"]="admin"form.select("input")[1]["value"]="default"

profiles_page=my_browser.submit(form,login_page.url)print(profiles_page.url)print(profiles_page.soup)

MethodsinMechanicalSoupWecreatedaBrowserobjectWecalledthemethodgetontheBrowserobjecttogetawebpageWeusedtheselect()methodtograbtheformandinputvaluesinit

InteractingwiththeWebinRealTimeWewanttogetdatafromawebsitethatisconstantlyupdatedWeactuallywanttosimulateclickingonthe"refresh"buttonWecandothatwiththegetmethodofMechanicalSoup

Usecase:fetchingthestockquotefromYahoofinance(1)

LetusidentifywhatisneededWhatisthesourceofthedata?

Whatdowewanttoextractfromthissource?Thestockprice

https://www.bloomberg.com/quote/YHOO:US

Usecase:fetchingthestockquotefromYahoofinance(2)

Ifwelookatthesourcecode,wecanseewhatthetagisforthestockandhowtoretrieveit:

Wecheckthat<divclass="price">onlyappearsonceinthewebpagesinceitwillbeawaytoidentifythelocationofthecurrentprice

<divclass="price">40.08</div>

MechanicalSoup:scripttofindYahoocurrentprice

importmechanicalsoup

my_browser=mechanicalsoup.Browser()page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswhere#thecssclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthe#first(andonly)<span>tagmy_price=my_tags[0].textprint("Thecurrentpriceof""YHOOis:{}".format(my_price))

RepeatedlygettheYahoocurrentpriceNowthatweknowhowtogetthepriceofastockfromtheBloombergwebpage,wecancreateaforlooptostayuptodateNotethatweshouldnotoverloadtheBloombergwebsitewithmorerequeststhanweneed.Andalso,weshouldalsohavealookattheir filetobesurethatwhatwedoisallowed

robots.txt

Introductiontothetime.sleep()methodThesleep()methodofthemoduletimetakesanumberofsecondsasargumentandwaitsforthisnumberofseconds,itenablestodelaytheexecutionofastatementintheprogram

fromtimeimportsleepprint"I'mabouttowaitforfiveseconds..."sleep(5)print"Donewaiting!"

RepeatedlygettheYahoocurrentprice:script

fromtimeimportsleepimportmechanicalsoupmy_browser=mechanicalsoup.Browser()#obtain1stockquoteperminuteforthenext3minutesforiinrange(0,3):page=my_browser.get("https://www.bloomberg.com/quote/YHOO:US")html_text=page.soup#returnalistofallthetagswheretheclassis'price'my_tags=html_text.select(".price")#taketheBeautifulSoupstringoutofthefirsttagmy_price=my_tags[0].textprint("ThecurrentpriceofYHOOis:{}".format(my_price))ifi<2:#waitaminuteifthisisn'tthelastrequestsleep(60)

Exercise:puttingitalltogetherInstallanewlibrarycalledrequestsUsing ofBeautifulSoup,parse(thatis,analyzeandidentifythepartsof)theimageofthedayof

Usingthegetmethodoftherequestslibrary,downloadtheimageCompletethefollowingprogram

theselectmethod

http://xkcd.com/

xkcd_incomplete.py

UsingrequestYoufirsthavetoimportit

Ifyouwanttodownloadthewebpage,usetheget()methodwithaurlinparameter,suchas:

Stopyourprogramifthereisanerrorwiththeraise_for_status()method

importrequests

res=requests.get(url)

res.raise_for_status()

Next?Webcrawling!FromWikipedia:AWebcrawlerisanInternetbotwhichsystematicallybrowsestheWorldWideWeb,typicallyforthepurposeofWebindexing.Howdoyounavigateawebsite?Forexample,forthe

website,howcouldyouretrieveallofitsimages?WritedownhowyouwoulddesignyourprogramWritetheprogram

http://xkcd.com/

SolutionfoWebCrawling Solution

session 9 - standing on the shoulders of giantsthe html language the primary language of information...

Documents

basic html. so, what exactly is html? html stands for: html...

being awesome with html a tutorial by zoe f. html: the quick...

hypertext markup language html

html. what is html html stands for the hypertext markup...

html/css 1 instructions 1. [[paragraph]] and …...2/13/2018...

hypertext markup language (html). introduction to html hyper...

questions - al-moshreq · 2 6. using a programming language...

لولأا لاؤسلا - wordpress.com text markup language...

html by joaquin vila. what is html? html stands for...

html: hypertext markup language

html€¦ · 3 introduction: abbreviation for hyper text...

html. what is html? html is a language for describing web...

hypertext markup language - html

hypertext markup language (html). hypertext markup language...

creating webpages using html

html ( hypertext markup language -...

basic html -...

1 xml extensible markup language. 2 xml vs. html html is a...

html hypertext markup language

basics of html . what is html? html or hyper text markup...