using dco data ( infrastructure , management , analysis, visualization, …)

Post on 22-Feb-2016

48 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Using DCO Data ( Infrastructure , Management , Analysis, Visualization, …). Data Science. Peter Fox @ taswegian , pfox@cs.rpi.edu (Marshall Ma) and the Data Science Team Tetherless World Constellation Rensselaer Polytechnic Institute DCO Summer School, July 14, 2014. Big Sky, MT. - PowerPoint PPT Presentation

TRANSCRIPT

Using DCO Data (Infrastructure, Management,

Analysis, Visualization, …)Peter Fox @taswegian, pfox@cs.rpi.edu (Marshall Ma) and

the Data Science TeamTetherless World Constellation

Rensselaer Polytechnic InstituteDCO Summer School, July 14, 2014. Big Sky, MT

DataSciencehttps://deepcarbon.net/group/dco-summer-school-2014

Deep Carbon ObservatoryGlobal community of ‘Carbon’ scientists (~1000 from ~40 countries) contributing to a Deep Earth Computer (data legacy) comprising:

• Global Earth Mineral Laboratory• Global Census of Deep Fluids• Global Volcano Gas Emissions• Global Census of Deep Microbial Life• Global State of High Pressure and Temperature Carbon and

Related Materials• Global Inventory of Diamonds with Inclusions• …

Data Science is …• Doing science with someone else’s data …

– across datasets– with models– multi-dimensional, multi-scale, multi-mode– complex data-types– needing new analytic and visual approaches

• Especially in multiple “dimensions” (functional) – E.g. Detection/ attribution methods/ algorithms– Visual exploration

DataScience

You may see many diagrams like

4

5

Physical quantity versus measured as quantity

Value and units?

Reference frame?

Reference units?Value and units?

Data

A scientist bringing new data

Spreadsheet

Diagram

Digital MapReport

A data manager transforming data

Transformed data ready for import

Repository staff/Data librarian

(Fleischer, 2011)

Importing toolA data repository

Internet

Use case: How DCO Finds Out About Data

Data-Information-Knowledge “Ecosystem”

7

Data Information Knowledge

Producers Consumers

Context

PresentationOrganization

IntegrationConversation

CreationGathering

Experience

8

Producers Consumers

Quality Control

Fitness for Purpose Fitness for Use

Quality Assessment

Trustee Trustor

Spreadsheets• E.g. Excel – import data

9

Documentation?

10

• Substantial metadata – how to visualize THIS?

Census of Deep Life

• To incline to one side; to give a particular direction to; to influence; to prejudice; to prepossess. [1913 Webster]

• A partiality that prevents objective consideration of an issue or situation [syn: prejudice, preconception]

• For acquisition – sampling bias is your enemy

• Cognitive bias is (due to) YOU!

12

Provenance*• Origin or source from which something

comes, intention for use, who/what generated for, manner of manufacture, history of subsequent owners, sense of place and time of manufacture, production or discovery, documented in detail sufficient to allow reproducibility– Internal– External

How you find DCO data…?• http://deepcarbon.net/dco_datasets

– Will soon be a window into community-based sources• http://metpetdb.rpi.edu • http://earthchem.org/• http://www.earthchem.org/petdb • http://vamps.mbl.edu/portals/deep_carbon/

cdl.php• …

Browser

All information is linked and traceable!

16

E.g. Deep Life (CoDL)New tools: R (statistics, visualization, modeling), D3.js (visualization) NOT just of the data, but of all types of information, knowledge! iPython Notebooks?

When You Use Data – Science 2.0• Version/ subsetting and converting to a format you are

familiar with is very common but mysterious– Take notes – document – provenance

• Software – what did you use and how?• Derived products – what did you create, how, why, etc.• Use the metadata every chance you get, e.g.

filenames!• Place them in a Web-accessible folder, consider getting

an identifier• Use social media, blogs, etc. to discuss it..

4 R’s … Goble and others

Exercise 1• Search for and access a dataset that you are not

familiar with:• Can you read it?• Can you make sense of it?• Can you assess quality, uncertainty?• Any sources of bias?• What would you need to do to make it useful?

When You Generate Data – Science 2.0• How the data was generated, why, for what, when and

in what format – Take notes – document – provenance

• Software – what did you use and how?• Derived products – what did you create, how, why, etc.• Use the metadata every chance you get, e.g.

filenames!• Place them in a Web-accessible folder, consider getting

an identifier• Use social media, blogs, etc. to discuss it..

Make it visible to DCO (can be private)https://deepcarbon.net/dco/dco-open-access-and-data-

policies https://deepcarbon.net/page/submit-community-

data You get an identifier! DCO-ID, can be cited, rewarded and much more…Share…

DCO checklist: what people have to do (courtesy UC3)

Your data management plan

Funding agency requirements

Creating your data

Organizing your data

Managing your data

Sharing your data

Domain Scientist

Data manager

Repository staff

Data Scientist

CurationServices

&Tools

Domain scientists often also take up these two roles,which however is not efficient and effective (i.e., the 80-20 rule). Data

Science

DCO checklist: a service & tool perspective

Your data management plan

AP Sloan requirements+

Creating your data

Organizing your data

Managing your data

Sharing your data

e.g., NSF New Proposal and Award Policies and Procedures Guide (effective January 14, 2013)

Object Modeling

Identity Services

Storage Services

Ingest Services

Discovery Service

Characterization Services

Access Services

CKAN, community

CKAN, community

Faceted search and Drupal etc.

DCO-ID (Handle+DOI)

+

Linked Data, community

Schema.org, etc.

Use cases, info. model

Exercise 2• Begin with a recent dataset that you generated or

we’re involved in generating• Can someone else read it?• Can someone make sense of it?• Have you asserted quality, uncertainty?• Have you described known sources of bias?• What else would you now do to make it more

useful?

Breakout Session Today• Exercises 1 and 2• Discussion

Friday• Marshall (Xiaogang) Ma will round out the data

discussion

• DCO goal for data: in the interim, – help you become data scientists (as well as your

specialty) • Then, in time…

– you can drop “data” because you will handle data as easily as you do field work, use instruments, etc…

top related