1 build the uk’s coins in the data science library cloud brand niemann us epa june 9, 2010 ...

30
1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010 http:// semanticommunity.net sclaimer: These slides do not reflect the views of the U.S. Environmental Protection A d does not constitute endorsement by the EPA of the standards or products mentioned.

Upload: horatio-skinner

Post on 04-Jan-2016

219 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

1

Build the UK’s COINS in the Data Science Library Cloud

Brand NiemannUS EPA

June 9, 2010http://semanticommunity.net

Disclaimer: These slides do not reflect the views of the U.S. Environmental Protection Agencyand does not constitute endorsement by the EPA of the standards or products mentioned.

Page 2: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

2

Overview

• The Challenge• The Data.gov.uk Program• The Expert and His Advice• The Cloud Tools• The Inspiration• The Data Sources• Other Sources of Data• The Process• The Results• Comments• Acknowledgements• References

Page 3: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

3

The Challenge• Tim Berners-Lee "Bag of Chips" talk:

– http://www.youtube.com/watch?v=ga1aSJXCFe0• To get five stars: 1-Expose your data, 2-Provide in machine readable format (Excel), 3-Provide as

CSV, 4-Provide at permanent URL, and 5-Provide metadata.

• Nigel_Shadbolt: Lots of eyeballs pouring over COINS:– http://bit.ly/b8XQGB - opendata in the wild - more functionality all the time.

• http://twitter.com/Nigel_Shadbolt/status/15419573652

• bniemannsr: @jahendler Hope data.gov evolves from quantity (500,000 datasets) to quality (data science applications):

– http://twitter.com/bniemannsr/status/15334914269• Note: Now data.gov says only 272,677.

• jahendler: @bniemannsr sure, but check out the Sem Web and Apps sections - lots of stuff there that prototypes what we could do #websci:

– http://twitter.com/jahendler/status/15335026437• bniemannsr: @jahendler Did, but neat prototypes don't improve data quality-data

science does:– http://radar.oreilly.com/2010/06/wha...a-science.html.

• http://twitter.com/bniemannsr/status/15549816659

• eGovernment Interest Group Teleconference, 04 Jun 2010:– http://www.w3.org/2010/06/04-egov-minutes.html Excerpts: Cory Casanave: Can't see the

Web of Data:• Cory to write up requirements/wishlist for generic Web of Data browser. See 

Supporting the Linked Data Consumer.

http://gaininitiative.wik.is/United_Kingdom#The_Challenge

Page 4: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

4

The Data.gov.uk Program

• Advised by Sir Tim Berners-Lee and Professor Nigel Shadbolt and others, government is opening up data for reuse. This site seeks to give a way into the wealth of government data and is under constant development. We want to work with you to make it better.

• We’re very aware that there are more people like you outside of government who have the skills and abilities to make wonderful things out of public data. These are our first steps in building a collaborative relationship with you.

http://gaininitiative.wik.is/United_Kingdom#The_Data.gov_Program

Page 5: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

5

The Data.gov.uk Program

http://data.gov.uk/

Page 7: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

7

The Expert and His Advice

• Edward Tufte Presidential appointment announced by White House, March 5, 2010.

• Tufte Comment on iPhone interface design: Better to have users looking over material adjacent in space within our eyespan rather than stacked in time. This is especially the case for statistical data, where the fundamental analytical task is to make comparisons. Also see page 159 in the above book reference.

http://gaininitiative.wik.is/United_Kingdom#The_Expert_and_His_Advice

Page 8: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

8

The Cloud Tools

http://cloud.mindtouch.com/

Page 9: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

9

The Cloud Tools

http://gaininitiative.wik.is/United_Kingdom

Page 10: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

10

The Cloud Tools

http://spotfire.tibco.com/

Page 11: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

11

The Cloud Tools

http://ondemand.spotfire.com/public/Help/index.htm

Page 13: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

13

The Inspiration

http://www.wheredoesmymoneygo.org/dashboard/

Page 14: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

14

The Inspiration

• What is data science? Analysis: The future belongs to the companies and people that turn data into products. Mike Loukides.– http://radar.oreilly.com/2010/06/wha...a-science.html.

• My Response: Please see my Data Science Library in the Cloud: http://ondemand.spotfire.com/public/...VL-4372/public and my suggestion that The 2010 Health 2.0 Developer Challenge should build a community health data science library-see http://federaldata.wik.is/ June 3rd: http://twitter.com/bniemannsr/status/15482514867 and http://www.hhs.gov/open/discussion/chdi.html.

http://gaininitiative.wik.is/United_Kingdom#The_Inspiration

Page 15: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

15

The Data Sources

http://data.gov.uk/dataset/coins

Scroll down toFull Description(see next slide)

Page 16: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

16

The Data Sources

http://hm-treasury.gov.uk/coins

Page 17: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

17

The Data Sources

• Tried Zipped 2009/10 Adjustment table, 31MiB (405MiB uncompressed): Got 405 MB text file that when imported into  Spotfire gave three columns with no headers and 317,346 rows (with the last row saying: (316,119 row(s) affected)!– See next slide.

• Read Comments: Saw where others had had trouble using these datasets.– Is this CSV?

• I unzipped the (non-torrent) version of the 09/10 adjustment table and it wasn't CSV but rather 2-sign delimited (think tab-delim with an @ instead of a tab). also the data wasn't clean for import to something like Excel as it had some lines of non-table data at the end - just the sort of thing to upset already hard-pushed spreadsheet importers on non-high end rigs.

– Posted on: Fri, 04/06/2010 - 14:18 — Anonymous

http://gaininitiative.wik.is/United_Kingdom#The_Data_Sources

Page 18: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

18

The Data Sources

COINS: Adjustment_table_extract_2009_10 in Spotfire-PC

Page 19: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

19

The Data Sources• Should have first read: The structure of the data is similar to that in a .csv

file with a string of characters being formed to represent each row, using the following delimiters:

– Line: carriage return (so lines are presented separately); and– Fields: @ .

• http://gaininitiative.wik.is/United_Kingdom/Understanding_the_COINS_data#The_Data_Files_and_Downloading

• And read: COINS contains millions of rows of data; as a consequence the files are large and the data held within the files complex. Using these download files will require some degree of technical competence and expertise in handling and manipulating large volumes of data. As such it is likely that this data will be most easily used by organisations that have such expertise, rather than individuals. More directly useful and accessible datasets that draw on the contents of the COINS database will be made available by August 2010.

– http://gaininitiative.wik.is/United_Kingdom/Understanding_the_COINS_data#Who_might_find_the_data_useful

Page 20: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

20

Other Sources of Data

For Output all as CSVcould get only5,000 of 72,644 rows.Sent question: Why?

http://gaininitiative.wik.is/United_Kingdom#Other_Sources_of_Data

http://coins.guardian.co.uk/coins-explorer/search

Page 21: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

21

Other Sources of Data

COINS: Data Explorer in Spotfire-PC

Hugh Expenditure forFinancial Stability forNorthern RockRefinancing!

Page 22: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

22

Other Sources of Data

http://coins.wheredoesmymoneygo.org/?items_per_page=100&page=1

Each has linkTo detailedTable – see next slide.

Could only get100 rows per page.Sent Question: Howget all 3,897,330?

Page 23: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

23

Other Sources of Data

http://coins.wheredoesmymoneygo.org/coins/fact_table_extract_2009_10.1361871

Page 24: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

24

Other Sources of Data

COINS: Where Does the Money Go? in Spotfire-PC

Page 25: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

25

The Process

• The Basic Steps:– Inventory Data Sources and Plan Application– Prepare and Import Data and Metadata– Implement Layout and Analytics– Add Bookmarks and Create Data Stories– Publish and Test in Web Player– Get Feedback and Improve 

• First create visualizations, faceted search (filters), and analytics for each individual data source and then look for relationships between the data sources.

http://gaininitiative.wik.is/United_Kingdom#The_Process

Page 26: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

26

The Results

• Recall The Challenges in slide 3:– TBL – Get 5 stars.– NS – Get more eyeballs on COINS.– JH -  Data.gov/semantic prototypes what we

could do with Web Science.– BN - Evolve from quantity of datasets to

quality data science applications.– CC - Can't see the Web of Data – Support the

Linked Data Consumer.http://gaininitiative.wik.is/United_Kingdom#The_Results

Page 27: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

27

The Results

• Tried to accomplish all five challenges.

• Waiting to hear back on requests for full data sets.

• Want to emulate Dashboard for Where Does My Money go?

• Want to work with other data sources in Data.gov.UK:– E.g. Climate Change.

http://gaininitiative.wik.is/United_Kingdom#The_Results

Page 28: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

28

Comments• The initial objective to see how fast one could create this basic

application. I am waiting to hear back on requests for full data sets. I want to emulate the Dashboard for Where Does My Money go? I want to work with other data sources in Data.gov.uk: E.g. Climate Change.

• Please use the Add Comment feature at the bottom of this wiki page to provide feedback and suggest additional analyses you would like to see. To use the Add Comment feature you first need to register by providing your email address. Your privacy will be respected and your email addressed will not be available to others or used for any other purpose. You can also download the Spotfire File from this Wiki and a 30-day free evaluation copy from http://spotfire.tibco.com/ and reuse these analyses, add your own data to this file or new Spotfire files that you create. Have fun and give us your feedback!

http://gaininitiative.wik.is/United_Kingdom#Comments

Page 29: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

29

Acknowledgements

• The author acknowledges gratefully Dean Allemang, Cory Casanave, Sean Connors, Mills Davis, Li Ding, David Eng, Lee Feigenbaum, Aaron Fulkerson, Jim Hendler, Ralph Hodgson, Kevin Kirby, Kevin Jackson, Bob Marcus, John McMahon, Richard Murphy, Brand Niemann, Jr., Barry Nussbaum, Matthew Phoenix, Tony Shaw, Jeff Stein, George Strawn, George Thomas, Pete Tseronis, and Edward Tufte.

http://gaininitiative.wik.is/United_Kingdom#Acknowledgements

Page 30: 1 Build the UK’s COINS in the Data Science Library Cloud Brand Niemann US EPA June 9, 2010  Disclaimer: These slides do not

30

References• Brand L. Niemann, Put Your Desktop in the Cloud to Support the

Open Government Directive and Data.gov/semantic, April 19, 2010, Semantic Universe.

• Brand L. Niemann, Build Your Own Data.gov (Spotfire) and EPA Microsite (Spotfire) with Semantics and Statistics in the Cloud, May 15, 2010. Slides.

• Brand L. Niemann, Build Your Community Health Information "Design for America" Using Mindtouch and Spotfire Analytics, May 17, 2010. Slides.

• Brand Niemann, Build Your Own Data.gov/semantic with Spotfire in the Cloud: The White House Visitor Database, May 22, 2010. Slides. See Data.gov takes the 'Mumsy' test, FCW, May 26, 2010.

• Edward R. Tufte, Beautiful Evidence (2006), Graphics Press LLC. 

http://gaininitiative.wik.is/United_Kingdom#References