r and collecting internet data - university of california, berkeley
TRANSCRIPT
![Page 1: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/1.jpg)
Motivation Tools we need to learn
R and Collecting Internet Data
Luis F. Campos
Department of StatisticsUniversity of California, Berkeley
February 11, 2011 - 4-5 PM - 1011 Evans Hall
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 2: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/2.jpg)
Motivation Tools we need to learn
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 3: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/3.jpg)
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 4: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/4.jpg)
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 5: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/5.jpg)
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 6: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/6.jpg)
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 7: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/7.jpg)
Motivation Tools we need to learn
Music Databases
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 8: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/8.jpg)
Motivation Tools we need to learn
Music Databases
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 9: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/9.jpg)
Motivation Tools we need to learn
Music Databases
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 10: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/10.jpg)
Motivation Tools we need to learn
Yahoo! Sports
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 11: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/11.jpg)
Motivation Tools we need to learn
Yahoo! Sports
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 12: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/12.jpg)
Motivation Tools we need to learn
Yahoo! Sports
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 13: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/13.jpg)
Motivation Tools we need to learn
Yahoo! Sports
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 14: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/14.jpg)
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 15: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/15.jpg)
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 16: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/16.jpg)
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
What is actually going on in your browser?
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 17: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/17.jpg)
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
What is actually going on in your browser?
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 18: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/18.jpg)
Motivation Tools we need to learn
XML/HTML
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 19: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/19.jpg)
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Language
markup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 20: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/20.jpg)
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)
extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 21: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/21.jpg)
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 22: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/22.jpg)
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >
HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 23: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/23.jpg)
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XML
HTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 24: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/24.jpg)
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 25: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/25.jpg)
Motivation Tools we need to learn
XML/HTML
What is HTML? Simply, a set of predetermined structuralmarkers.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>
<head><title>The document title</title>
</head><body>
<h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>
</body></html>
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 26: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/26.jpg)
Motivation Tools we need to learn
XML/HTML
What is HTML? Simply, a set of predetermined structuralmarkers.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>The document title</title>
</head><body><h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>
</body></html>
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 27: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/27.jpg)
Motivation Tools we need to learn
XML/HTML
Another useful way to view an HTML document is as a tree.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 28: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/28.jpg)
Motivation Tools we need to learn
XML/HTML
We call the boxes nodes and the arrows edges
an edge goes from a to b if:<a>
<b></b>
</a>
Note: there is a unique path from the root node to anygiven node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 29: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/29.jpg)
Motivation Tools we need to learn
XML/HTML
We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>
<b></b>
</a>
Note: there is a unique path from the root node to anygiven node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 30: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/30.jpg)
Motivation Tools we need to learn
XML/HTML
We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>
<b></b>
</a>
Note: there is a unique path from the root node to anygiven node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 31: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/31.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:
a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">
Statistics Website</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 32: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/32.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily unique
any number of attributesoptional text<a href = "www.stat.berkeley.edu">
Statistics Website</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 33: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/33.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributes
optional text<a href = "www.stat.berkeley.edu">
Statistics Website</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 34: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/34.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text
<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 35: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/35.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 36: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/36.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: a
attribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 37: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/37.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"
text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 38: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/38.jpg)
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 39: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/39.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 40: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/40.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 41: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/41.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node
// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 42: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/42.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree
. selects current node
.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 43: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/43.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node
.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 44: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/44.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node
@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 45: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/45.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 46: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/46.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 47: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/47.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"
If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 48: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/48.jpg)
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 49: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/49.jpg)
Motivation Tools we need to learn
R Programming Language
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 50: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/50.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?
Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 51: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/51.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.
It’s free! So you can spend money on other things.Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 52: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/52.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 53: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/53.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 54: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/54.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatistics
Implementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 55: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/55.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitive
There are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 56: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/56.jpg)
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 57: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/57.jpg)
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)
We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 58: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/58.jpg)
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:
It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 59: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/59.jpg)
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)
Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 60: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/60.jpg)
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 61: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/61.jpg)
Motivation Tools we need to learn
R Package: XML
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 62: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/62.jpg)
Motivation Tools we need to learn
R Package: XML
Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:
htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)
getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 63: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/63.jpg)
Motivation Tools we need to learn
R Package: XML
Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:
htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)
getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 64: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/64.jpg)
Motivation Tools we need to learn
R Package: XML
Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:
htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)
getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 65: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/65.jpg)
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 66: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/66.jpg)
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 67: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/67.jpg)
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 68: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/68.jpg)
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.
xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 69: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/69.jpg)
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 70: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/70.jpg)
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 71: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/71.jpg)
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!
So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 72: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/72.jpg)
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!
Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 73: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/73.jpg)
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!
This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 74: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/74.jpg)
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 75: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/75.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 76: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/76.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.
We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 77: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/77.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 78: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/78.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"
Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 79: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/79.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas above
Quantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 80: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/80.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 81: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/81.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"
asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 82: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/82.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...
plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 83: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/83.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 84: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/84.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...
[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 85: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/85.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"
These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 86: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/86.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.
There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 87: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/87.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 88: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/88.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2
gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 89: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/89.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2
gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 90: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/90.jpg)
Motivation Tools we need to learn
Regular Expressions (regex)
strsplit>strsplit(c("abcdsf", "fabcasda", "cba"), "abc")[[1]][1] "" "dsf"
[[2]][1] "f" "asda"
[[3]][1] "cba"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 91: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/91.jpg)
Motivation Tools we need to learn
Demonstration/Resources
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
![Page 92: R and Collecting Internet Data - University of California, Berkeley](https://reader036.vdocument.in/reader036/viewer/2022071600/613d1cc2736caf36b759751c/html5/thumbnails/92.jpg)
Motivation Tools we need to learn
Demonstration/Resources
We’ll go though a quick Demo! (time permitting)Resources:
R: http://cran.r-project.org/R::XML:http://cran.r-project.org/web/packages/XML/index.htmlDuncan Temple Lang:http://www.stat.ucdavis.edu/ duncan/XPath Tutorial:http://www.w3schools.com/xpath/default.aspRegEx: http://www.regular-expressions.info/reference.html,WikipediaThis Presentation: http://www.stat.berkeley.edu/ luis/
Luis F. Campos UC, Berkeley
R and Collecting Internet Data