data storytelling studio getting & cleaning data · batchgeo can do a lot for you for free...

23
Data Storytelling Studio getting & cleaning data CMS.631/831 Rahul Bhargava 1

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Data Storytelling Studio getting amp cleaning data

CMS631831 Rahul Bhargava

1

Agenda

[10] Review data logs

[10] Getting data

[10] Grad student presentation on open data papers

[20] Cleaning data

[10] Presentation crit

[5] Homework prep

2

data log the most nefarious

pair amp share the most benign

the most surprising

3

Getting data

4

Sources of Data

Official sources (ie govt agency)

Advocacy interest groups

Personal knowledge

Make it yourself

5

If the data doesnt exist

$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 2: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Agenda

[10] Review data logs

[10] Getting data

[10] Grad student presentation on open data papers

[20] Cleaning data

[10] Presentation crit

[5] Homework prep

2

data log the most nefarious

pair amp share the most benign

the most surprising

3

Getting data

4

Sources of Data

Official sources (ie govt agency)

Advocacy interest groups

Personal knowledge

Make it yourself

5

If the data doesnt exist

$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 3: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

data log the most nefarious

pair amp share the most benign

the most surprising

3

Getting data

4

Sources of Data

Official sources (ie govt agency)

Advocacy interest groups

Personal knowledge

Make it yourself

5

If the data doesnt exist

$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 4: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Getting data

4

Sources of Data

Official sources (ie govt agency)

Advocacy interest groups

Personal knowledge

Make it yourself

5

If the data doesnt exist

$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 5: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Sources of Data

Official sources (ie govt agency)

Advocacy interest groups

Personal knowledge

Make it yourself

5

If the data doesnt exist

$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 6: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

If the data doesnt exist

$ampamp$203(4-)05+(5amp8 $ampamp()(+)-$+01amp867 9 203 4- lt0 =gt 01 9 ltAB+ -- 01$80$04C+ A$8(33 $8D(-ECC 7058E0 04 553$8-(3$+ F0 50 37053G8$8ampamp()+5+CEamp-amp7HI70IE$amp8 6

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 7: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Document Object Model (ie the DOM)

98A352 J838KE0+ --801$80$04C+ A$8(33 $8D(-ECC 705 E0 048553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8

7

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 8: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Tools to Scrape easy to learn

copy amp paste

Chrome scraper extension

importio

does lots of things does one thing

BeautifulSoup mechanize

jquery in the browser requests

hard to learn

8

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 9: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

open data papers Joel Gurin 2014 Open Governments Open Data A New Lever for Transparency Citizen Engagement and Economic Growth SAIS Review of International Affairs 34 1 (2014) 71ndash82

Michael B Gurstein 2011 Open data Empowering the empowered or effective data use for everyone First Monday 16 2 (January 2011)

9

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 10: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

How have you seen data stored

10

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 11: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Storage strategies

csv files

relational databases

non-relational databases

text files

pdf files

HTML tables

11

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 12: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

What is clean data

12

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 13: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Clean Data

Consistency are observations always entered the same

Completeness do you have coverage of the topic

Usability machine readability

Atomicity row-based normalization

Since we use machines to operate on data machine-readability is a strong criteria

And dont forget about the metadata

See the Quartz guide to bad data 13

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 14: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

About Tidy Data

Hadley Wickham 2014 Tidy Data Journal of Statistical Software 59 10 (August 2014)

14

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 15: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Tools to clean easy to learn

find amp replace

BatchGeo Excel Tabula geocodio

Open Refine Trifacta

Wrangler

does lots of things regular does one thing expressions

hard to learn

textract messytables pdftables

15

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 16: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Cleaning geographic data

Geoparsing finding references to geographic places in text

tricky but my Cliff tool does some of this

Geocoding turning an address into latitudelongitude coordinates

BatchGeo can do a lot for you for free

16

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 17: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Getting data out of PDF files

assuming the text is readable

Lets open up an example PDF and try out Tabula (the best Ive seen so far)

If youre a programmer pdftables is a useful option

if it is an image

youre in trouble - the automated OCR toolchain isnt great

17

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 18: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Cleaning text numbers

misspellings try OpenRefine to cluster them

extracting data try regular expressions (use a cheatsheet) (learn it yourself)

splitting columns remember Excel can do some of this

anonymizing scrubadubio is an in-progress tool to help

18

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 19: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Using image data

You can analyze images qualitatively and quantitatively by repurposing tools like Google Photos

98 L1-+8--801$80$04C+ A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8ampamp()+5+CEamp-amp7HI70IE$amp8 19

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 20: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

Another critique

20

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 21: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

A more complex example bitlyclimate123

98M315 A(3-1$83(+ -- 01$80$04C+8A$8(33 $8D(-ECC 705 E0 04 553$8-(3$+ F0 50 37053G $8$ampamp()+5+CEamp-amp7HI70IE$amp8

21

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 22: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

homework

install Tableau

read stuff

grad student to present reading on machine learning amp big data

22

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23

Page 23: Data Storytelling Studio getting & cleaning data · BatchGeo can do a lot for you for free Getting data out of s : assuming the text is readable: Let's open up : an example PDF and

MIT OpenCourseWare httpsocwmitedu

CMS631 Data Storytelling Studio Climate Change Spring 2017

For information about citing these materials or our Terms of Use visit httpsocwmiteduterms

23