Motivation Tools we need to learn
R and Collecting Internet Data
Luis F. Campos
Department of StatisticsUniversity of California, Berkeley
February 11, 2011 - 4-5 PM - 1011 Evans Hall
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Internet Movie Database (IMDb)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Music Databases
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Music Databases
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Music Databases
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Yahoo! Sports
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Yahoo! Sports
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Yahoo! Sports
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Yahoo! Sports
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
What is actually going on in your browser?
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Main example: Superbowl - Play by Play Data
What is actually going on in your browser?
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Language
markup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)
extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >
HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XML
HTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is XML?
XML stands for eXtensible Markup Languagemarkup languages: provide annotations for structure(marks) that are distinguishable from text (examples:LaTeX, html, PostScript)extensible: you can create your own marks
In XML, the structure markers are angle brackets: < >HTML could be considered a derivative of XMLHTML (HyperText Markup Language): is currently thepredominant markup language
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is HTML? Simply, a set of predetermined structuralmarkers.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html>
<head><title>The document title</title>
</head><body>
<h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>
</body></html>
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
What is HTML? Simply, a set of predetermined structuralmarkers.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"><html><head><title>The document title</title>
</head><body><h1>Main heading</h1><p>A paragraph.</p><a href = "www.stat.berkeley.edu">Statistics Website</a>
</body></html>
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Another useful way to view an HTML document is as a tree.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
We call the boxes nodes and the arrows edges
an edge goes from a to b if:<a>
<b></b>
</a>
Note: there is a unique path from the root node to anygiven node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>
<b></b>
</a>
Note: there is a unique path from the root node to anygiven node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
We call the boxes nodes and the arrows edgesan edge goes from a to b if:<a>
<b></b>
</a>
Note: there is a unique path from the root node to anygiven node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:
a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">
Statistics Website</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily unique
any number of attributesoptional text<a href = "www.stat.berkeley.edu">
Statistics Website</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributes
optional text<a href = "www.stat.berkeley.edu">
Statistics Website</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text
<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: a
attribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"
text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML/HTML
Each node can have the following elements associated to it:a required name for the node, not necessarily uniqueany number of attributesoptional text<a href = "www.stat.berkeley.edu">Statistics Website
</a>
nodename: aattribute: href with value "www.stat.berkeley.edu"text: "Statistics Website"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node
// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree
. selects current node
.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node
.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node
@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
XPath is a query language that allows you to retrieve specificnodes in a XML/HTML tree. Some of the symbols we can useto represent a path on the tree:
/ finds the root node// selects from anywhere in the tree. selects current node.. selects parent of current node@ selects attributesnodename finds location(s) of the named node
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"
If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
XML Path Query Language (XPath QL)
To traverse to the only anchor node below, we can use:"/../body/div/a" or "//a"If we had <a href = "stat.berkeley.edu">stats</a>:"//a[@href = ’stat.berkeley.edu’]" or "//a[text() = ’stats’]"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?
Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.
It’s free! So you can spend money on other things.Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?
Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatistics
Implementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitive
There are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
Why R, in general?Plenty of packages! So you can extend R to do whateveryou want.It’s free! So you can spend money on other things.
Why R, for collecting internet data?Collecting data for statistical purposes <–> R is made forstatisticsImplementations of XPath, Regular Expressions are fairlyintuitiveThere are other ways, but they follow similar procedures
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)
We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:
It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)
Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Programming Language
R: briefly
R is a command line interpreter language (interpretscommands you type)We’ll be using functions provided by packages (XML) andthe base package. Example of calling a function below:It’s usually a good idea to name all named arguments whencalling a function (otherwise R will just use them in order)Notice that if fun has any output it will be stored in x
x <- fun(unnamed, arg = named)
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:
htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)
getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:
htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)
getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
Created by Prof. Duncan Temple Lang. This package is acollection of functions that will enable us to work withXML/HTML tree structures. Some of the functions we will useare:
htmlTreeParse: gets html page specified and creates aninternal tree structure. We can then use XPath to traversethe tree>url="http://en.wikipedia.org/wiki/Main_Page">doc=htmlTreeParse(url,useInternalNodes = TRUE)
getNodeSet: takes document name and XPathinstructions about the node we’re looking for:> x1 = getNodeSet(doc, "//div[@class = ’bd’]")> x2 = getNodeSet(doc, "//a")
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.
xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
x1 will contain the location of all the nodes named "div"who have a class attribute with value "bd"
x2 will contain locations of ALL the anchor nodes (containlinks)! Which may or may not be useful (unless you’reGoogle).
xmlParent: gets location of the parent of the specifiednode:> xmlParent(x1[[1]])
xmlChildren, xmlAncestors: gets all children orancestors of a given node.xmlValue: gets the text of a given node.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!
So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!
Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!
This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
R Package: XML
One shortcut:If you know you want to get a table element from an htmlfile
a table is a very specific html element!
readHTMLTable will retrieve all table nodes and transformthem into a workable form for you!So, we’re done right?!Correct! No, not everything is stored in a table!This is why we learned all this!
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.
We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"
Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas above
Quantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"
asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...
plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Provide a framework for finding specific text in a body oftext for a list of possible texts.We can specify the string we are looking for with thefollowing quantifiers:
OR operator - | - ex: "gray|grey" finds all instances of either"grey" or "gray"Grouping (parenthesis)- ( ) - ex: "gr(a|e)y". Same statementas aboveQuantifiers (number of), given after the character or group:
question mark - ? - zero or one - ex: "colou?r" finds "color"or "colour"asterisk - * - zero or more - ex: "ab*c" finds "ac", "abc","abbc", ...plus sign - + - one or more - ex: "ab+c" finds all from aboveexcept "ac"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...
[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"
These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.
There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
POSIX Expressions:. - matches ANY single character - ex: "a.c" finds "aac","abc", "acc",...[ ] - matches a single character from the brackets - ex:"gr[ea]y" finds "grey" or "gray"These are different from parenthesis since ranges can bespecified in square brackets: a-z means any lower casealphanumeric character.There are a ton more. See reference.
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2
gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
Regular Expressions in R are used in the following functions:grep finds a pattern in a list of candidates:>grep("abc",c("abcdsf", "fabcasda", "cba"))[1] 1 2
gsub replaces a pattern with another in a list ofcandidates:>gsub("abc","CANDY",c("abcdsf", "fabca", "cba"))[1] "CANDYdsf" "fCANDYa" "cba"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Regular Expressions (regex)
strsplit>strsplit(c("abcdsf", "fabcasda", "cba"), "abc")[[1]][1] "" "dsf"
[[2]][1] "f" "asda"
[[3]][1] "cba"
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Demonstration/Resources
Outline1 Motivation
Internet Movie Database (IMDb)Music DatabasesYahoo! SportsMain example: Superbowl - Play by Play Data
2 Tools we need to learnXML/HTMLXML Path Query Language (XPath QL)R Programming LanguageR Package: XMLRegular Expressions (regex)Demonstration/Resources
Luis F. Campos UC, Berkeley
R and Collecting Internet Data
Motivation Tools we need to learn
Demonstration/Resources
We’ll go though a quick Demo! (time permitting)Resources:
R: http://cran.r-project.org/R::XML:http://cran.r-project.org/web/packages/XML/index.htmlDuncan Temple Lang:http://www.stat.ucdavis.edu/ duncan/XPath Tutorial:http://www.w3schools.com/xpath/default.aspRegEx: http://www.regular-expressions.info/reference.html,WikipediaThis Presentation: http://www.stat.berkeley.edu/ luis/
Luis F. Campos UC, Berkeley
R and Collecting Internet Data