1 scientific (data-oriented) workflow for the semantic community brand niemann data science class...

20
1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

Upload: millicent-booth

Post on 31-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

1

Scientific (Data-Oriented) Workflow for the Semantic Community

Brand Niemann

Data Science Class Final Assignment

December 7, 2010

Page 2: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

2

Assignments• General assignment:

– Stewardship: Workflow construction for preservation. Use the data workflow from Assignment 4.

• Specific Assignment:– 1. Construct a scientific (data-oriented) workflow. Present the workflow

in a diagram form with suitable annotations (documentation) for someone else to review. The workflow should be an accurate depiction of the data flow from the chosen project but can be for one specific instance, i.e. it need not be general. The diagram should be included in what you submit.  Provide a minimum of a 5- 6-sentence description of what was required to carry this out. Hint: the diagram can be hand drawn, computer drawn, or developed in a workflow tool – in the latter case the workflow need not be ‘executable’.

– 2. Describe each major stage of the workflow and indicate how well (or poorly) data and information preservation is enabled or accommodated. You may include URLs to information sources, etc. 

– 3. Describe how the workflow and your assessment of existing documentation enabled (or not) the data stewardship. Hint:  Review notes from class 2 as well as class 12.

Page 3: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

3

Full Life Cycle of Data

• 1. Data: Research, Creation, Gathering, and Discovery

• 2. Information: Presentation and Organization

• 3. Knowledge (Experience): Conversation, Storytelling, and Integration

• 4. Wisdom: Contemplation (Evaluation), Interpretation, and Retrospection

Source: Professor Peter Fox, Slide 18 of Data Science, Week 1, August 31, 2010.

Note: 1 and 2 by Data Producers and 2 and 3 for Data Consumers.

Note: The Context for these spans Global, Local, and Personal.

Page 4: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

4

Curation Stages

Source: Professor Peter Fox, Slide 19 of Data Science, Week 1, August 31, 2010.

Note:Need to do it allOften just ‘get’ tothe MetadataToo many ‘get’it only to the EndUser

Page 5: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

5

Scientific Data Workflows

Source: Professor Peter Fox, Slide 2 of Data Science Week 12, November 23, 2010.

Page 6: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

6

Semantic Web Methodology and Technology Development Process

• Establish and improve a well-defined methodology vision for Semantic Technology based application development

• Leverage controlled vocabularies, etc.• Steps in Iteration:

– Open World: Evolve, Iterate, Redesign, Redeploy– Rapid Prototype– Leverage Technology Infrastructure– Adopt Technology Approach– Science/Expert Review & Iteration– Use Tools– Develop model/ ontology– Analysis– Evaluation: Small Team, mixed skills– Use Case 

• Some Observations:– Scientist does not have to write all the code to perform the analysis– Can compose workflows that utilize distributed data/services– Can share the workflow with others to collaborate, reuse and modify

Source: Professors Deborah McGuinness and Joanne Luciano, Slide 35 of Semantic eScience Week 1, August 30, 2010.

Page 7: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

7

Assignment Four

• Migration of 40 Deki.wik.is to the new MindTouch 2010 OnDemand Technical Communication Suite (this is Open Source Software based on a "fork" from the Media Wiki software running on the Amazon Cloud with support services from MindTouch) at http://semanticommunity.net/.

• Mashup-of-Mashups Catalog of the RPI Data-gov and Linked Open Government Data Demos at http://logd.tw.rpi.edu/demos and http://data-gov.tw.rpi.edu/wiki/Demos

See http://semanticommunity.info/Data_Science_Class_Fall_2010/Assignment_4

Page 8: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

8

Scientific (Data-Oriented) Workflow Phases

• Phase 1: Implement the New Curation Environment and Develop the Migration Plan

• Phase 2: Implement the Migration Plan in a Spreadsheet• Phase 3: Prioritize the Content for Migration• Phase 4: Evolve a Process for Standardizing the

Migration of Each Type of Content• Phase 5: Ask Questions of MindTouch Support• Phase 6: Annotate the Migrated Content for Unfinished

Tasks• Phase 7: Thoroughly Check the Migrated Content Before

Deactivating the Original Content• Phase 8: Report on One Instance in Semantic eScience

Class Final Assignment

Page 9: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

9

Content Strategy:How and Why to Curate Content

• Curate When You Can't Create:– Like any good museum exhibit, it’s important to have your visitor

understand the relevance of the information provided. By making it relatable, curated content can guide a reader’s interest, while conveying important information.

• The Wisdom on Curated Content– Curated content allows users to be more selective about the

content they choose to read. You could find it all on your own, but what for? Your time is valuable so why not let others guide your search. Conversely, if you are able to synthesize lots of content because you not only know where to look, but you understand how specific information can affect the larger picture, why not share that with others?

Source: http://www.cmswire.com/cms/web-engagement/content-strategy-how-and-why-to-curate-content--009454.php

Page 10: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

10

Phase 1: Implement the New Curation Environment and Develop the Migration Plan

• 1. Ordering the Wiki and Domain Services

• 2. User Manual

• 3. Some Feature Highlights

• 4. Dashboard

• 5. Home Page

• 6. Privacy and Permissions

• 7. Migration Plan

See http://semanticommunity.info/@api/deki/files/4496/=BrandNiemann11052010.ppt

Page 11: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

11

Phase 2: Implement the Migration Plan in a Spreadsheet

• The Spreadsheet contains the following:– Cover Page: URL of each Wiki, total pages of each

and overall (6338), and status information (see next slide).

– Strategy: Steps in strategy and lessons learned– Master Log: Status of work on the highest priority

content migration pages– Individual Sheets for the 40 Deki Wiki Contents: The

Deki Wiki Site Maps were copied to a metadata format (Name, Description, URL, and Comment)

• Note: It was decided that this much detailed was not needed as the migration progressed.

Page 12: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

12

Phase 2: Implement the Migration Plan in a Spreadsheet

See http://semanticommunity.info/@api/deki/files/1045/=MindTouchDekiWikis10092010.xls

Page 13: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

13

Phase 3: Prioritize the Content for Migration

• The Deki Wiki with the largest number of pages and most interlinked content (EPA Ontology) was chosen as the one free migration as part of the MindTouch 2010 and its migration was carefully checked before proceeding.

• The EPA Ontology was then moved down one level to make room for a top level home page that would feature the highest priority content in a standardized metadata format, namely, Topic, Metadata, Data Analytics, Presentation, and, Comments (see next slide). It was subsequently moved down another level to be part of more than 30 EPA data science applications.

Page 14: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

14

Phase 3: Prioritize the Content for Migration

http://semanticommunity.info/

Page 15: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

15

Phase 4: Evolve a Process for Standardizing the Migration of Each Type of Content

• I continued the migration by going from the Deki Wikis with the smallest to the largest number of pages reasoning that the larger generally linked to the smaller and to gain experience that would help with the largest.

• This proved to be a great migration strategy as I picked up speed from repetition and generally standardized on the following steps:

– Copy the Deki Wiki Title and first page and download and upload all its attachments.– Then start at the bottom of the Deki Wiki and work upwards through its pages in the left

pane being careful to copy all of the content and attachments or leaving annotations at the top of the page as to what need to be completed.

– I used the Table of Contents Deki Script mentioned earlier to collapse almost all of the multiple page documents and reports into one page which saved an enormous amount of time in having to recreated all those sub-pages, by only coping their content to the table of contents. The table of contents had to have the Heading Levels assigned to it carefully to retain the original multiple page structure. MindTouch 2010 support down to 5 Heading Levels.

– I also made sure to download and upload ever graphics and insert in in the new content page and linked the attachments. Fortunately I had learned early in the Deki Wiki work to carefully label the graphics and attachments. I also retained the attachment descriptions if they had been created.

– Finally, I checked everything carefully by redoing the hyperlinks on the Home Page and on every page with annotations that something still needed to be done. Where I had linked to attachments that were not a part of the page, I usually downloaded and uploaded that attachment to that page reasoning that that attachment was not likely to change and/or that I really wanted to capture that attachment in the context in which I had originally linked to it from that page. As more Deki Wiki content was added in the migration, I could usually find those attachments, but still thought it was good curation policy to attach a separate version to that page to preserve the context as mention before.

Page 16: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

16

Phase 5: Ask Questions of MindTouch Support

• I wondered where the original Deki Wiki Table of Contents feature was in the MindTouch 2010 and learned that it was available when using only certain ‘skins” but could be readily used in ever page by the Deki Script of {{wiki.toc(page.path)}}!

• This made a tremendous difference in reducing the amount of work in the migration – now elaborate multiple page ‘drill downs’ in the Deki Wiki could become single pages in the MindTouch 2010!

Page 17: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

17

Phase 6: Annotate the Migrated Content for Unfinished Tasks

• Since there is no ‘perfect order’ in which to do a large content migration in which all the hyperlinks in any page have already been migrated, it was decided to annotate the top of pages that ‘need link fixes’ and even use that to lead the detailed migration strategy to migrate other pages that needed to be linked to from higher priority pages.

• These ‘need link fixes’ can be easily found through using the search feature of MindTouch 2010 and fix later on as part of the perpetual jog of ‘Wiki gardening’.

Page 18: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

18

Phase 7: Thoroughly Check the Migrated Content Before Deactivating the Original Content

• I forgot that graphics embedded in the Deki Wiki content cannot just be copied and pasted to the MindTouch 2010, but need to be downloaded from the original Deki Wiki and uploaded and inserted into the MindTouch 2010.

• Fortunately, I discovered early on that when the MindTouch 2010 was prompting me for a log-in on a copied page, I realized that I had closed each Deki Wiki from public view as I completed its migration and the graphics were no longer viewable with out a log-in because they still resided on the Deki Wiki and not the MindTouch 2010.

Page 19: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

19

Phase 8: Report on One Instance in Semantic eScience Class Final Assignment

• My goal for the past 5 years or so has been to develop and EPA Ontology based on the best subject matter expertise and content my agency has produced which is its 2008 Report on the Environment. I have been involved in State of the Environment Reporting since its inception at EPA in the early 1990s when I was part of the Center for Environment Statistics which had the goal of becoming the Bureau of Environmental Statistics (BES). While a BES was never formed, after many iterations and an extensive peer review process that consumed about 5 years, a Report on the Environment was produced in 2008 and has been updated periodically. My 2008 EPA Performance Appraisal and Recognition System (PARS) contained the Critical Element to develop an EPA Ontology which I did using the 2008 Report on the Environment which was first distributed as a large number of PDF and Excel files, and then the PDF files were converted to a Web-format (JQUERY).

Page 20: 1 Scientific (Data-Oriented) Workflow for the Semantic Community Brand Niemann Data Science Class Final Assignment December 7, 2010

20

Phase 8: Report on One Instance in Semantic eScience Class Final Assignment

http://semanticommunity.info/EPA/EPA_Ontology