data storytelling studio getting & cleaning data · batchgeo can do a lot for you for free...
Post on 15-Jul-2020
0 Views
Preview:
TRANSCRIPT
Data Storytelling Studio getting amp cleaning data
CMS631831 Rahul Bhargava
1
Agenda
[10] Review data logs
[10] Getting data
[10] Grad student presentation on open data papers
[20] Cleaning data
[10] Presentation crit
[5] Homework prep
2
data log the most nefarious
pair amp share the most benign
the most surprising
3
Getting data
4
Sources of Data
Official sources (ie govt agency)
Advocacy interest groups
Personal knowledge
Make it yourself
5
If the data doesnt exist
$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Agenda
[10] Review data logs
[10] Getting data
[10] Grad student presentation on open data papers
[20] Cleaning data
[10] Presentation crit
[5] Homework prep
2
data log the most nefarious
pair amp share the most benign
the most surprising
3
Getting data
4
Sources of Data
Official sources (ie govt agency)
Advocacy interest groups
Personal knowledge
Make it yourself
5
If the data doesnt exist
$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
data log the most nefarious
pair amp share the most benign
the most surprising
3
Getting data
4
Sources of Data
Official sources (ie govt agency)
Advocacy interest groups
Personal knowledge
Make it yourself
5
If the data doesnt exist
$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Getting data
4
Sources of Data
Official sources (ie govt agency)
Advocacy interest groups
Personal knowledge
Make it yourself
5
If the data doesnt exist
$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Sources of Data
Official sources (ie govt agency)
Advocacy interest groups
Personal knowledge
Make it yourself
5
If the data doesnt exist
$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
If the data doesnt exist
$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Document Object Model (ie the DOM)
98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8
7
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Tools to Scrape easy to learn
copy amp paste
Chrome scraper extension
importio
does lots of things does one thing
BeautifulSoup mechanize
jquery in the browser requests
hard to learn
8
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82
Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)
9
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
How have you seen data stored
10
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Storage strategies
csv files
relational databases
non-relational databases
text files
pdf files
HTML tables
11
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
What is clean data
12
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Clean Data
Consistency are observations always entered the same
Completeness do you have coverage of the topic
Usability machine readability
Atomicity row-based normalization
Since we use machines to operate on data machine-readability is a strong criteria
And dont forget about the metadata
See the Quartz guide to bad data 13
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
About Tidy Data
Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)
14
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Tools to clean easy to learn
find amp replace
BatchGeo Excel Tabula geocodio
Open Refine Trifacta
Wrangler
does lots of things regular does one thing expressions
hard to learn
textract messytables pdftables
15
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Cleaning geographic data
Geoparsing finding references to geographic places in text
tricky but my Cliff tool does some of this
Geocoding turning an address into latitudelongitude coordinates
BatchGeo can do a lot for you for free
16
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Getting data out of PDF files
assuming the text is readable
Lets open up an example PDF and try out Tabula (the best Ive seen so far)
If youre a programmer pdftables is a useful option
if it is an image
youre in trouble - the automated OCR toolchain isnt great
17
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Cleaning text numbers
misspellings try OpenRefine to cluster them
extracting data try regular expressions (use a cheatsheet) (learn it yourself)
splitting columns remember Excel can do some of this
anonymizing scrubadubio is an in-progress tool to help
18
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Using image data
You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos
98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
Another critique
20
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
A more complex example bitlyclimate123
98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8
21
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
homework
install Tableau
read stuff
grad student to present reading on machine learning amp big data
22
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
MIT OpenCourseWare httpsocwmitedu
CMS631 Data Storytelling Studio Climate Change Spring 2017
For information about citing these materials or our Terms of Use visit httpsocwmiteduterms
23
top related