please read (hidden slide) · common problems with data to use data from different sources o...

37
[email protected] http://research.microsoft.com

Upload: others

Post on 29-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 2: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

A Tidal Wave of Scientific Data

Page 3: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

• Experimental Science

• Theoretical Science

• Newton‟s Laws, Maxwell‟s Equations…

• Computational Science

• Simulation of complex phenomena

• Data-Intensive Science

• captured by instruments

• generated by simulations

• generated by sensor networks

2

2

2.

3

4

a

cG

a

a

Page 4: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

•••

•••

•••

••

http://fourthparadigm.org

Page 5: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

An edited collection of 26 short

technical essays, divided into 4

sections

Page 6: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

The Problem for the e-Scientist

• Data ingest

• Managing a petabyte

• Common schema

• How to organize it

• How to reorganize it

• How to share with others

• Query and Vis tools

• Building and executing models

• Integrating data and Literature

• Documenting experiments

• Curation and long-term preservation

The Generic Problems

(With thanks to Jim Gray)

Experiments & Instruments

Simulations

Literature

Other Archives

facts

facts

facts

facts

Questions

Answers

Page 7: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata
Page 8: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Monitoring

Collation

Quality assurance

Aggregation

Analysis

Reporting

Forecasting

Distribution

Done poorly, but a few notable

counter-examples

Done poorly to moderately, not easy to find

Sometimes done well, generally discoverable and available,

but could be improved

Integration

(I. Zaslavsky & CSIRO, BOM, WMO)

Page 9: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Environmental Ecosystem

9

Action Knowledge

Inform

Page 10: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Environmental Ecosystem

10

Analysis

Insight

Publish

Data

Action Knowledge

Communicate

Decide

Implement

Inform

Page 11: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Page 12: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Data Variety – The Spice of Life

Manual Measurement

Automated Measurement

Sample Collection

Historical Photographs

Counting

Satellite

Relatively

Ubiquitous

Motes Aircraft Surveys Model Output

Typing

Page 13: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

••

•••

Page 14: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

••

••

••

••

••

Source Data (Swath format)

Reprojected Data (Sinusoidal format)

Page 15: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Why Make this Distinction?

• Provenance and trust widely varies

• Data acquisition, early processing, and reporting ranges from a large government agency to individual scientists.

• Smaller data often passed around in email; big data downloads can take days (if at all)

• Data sharing concerns and patterns vary

• Open access followed by (non-repeatable and tedious) pre-processing

• True science ready data set but concerns about misuse, misunderstanding particularly for hard won data.

• Computational tools differ.

• Not everyone can get an account at a supercomputer center

• Very large computations require engineering (error handling)

• Space and time aren‟t always simple dimensions

Complex shared detector Simple instrument (if any)

Complex and Heavy process by experts Ad hoc observations and models

KB

PB

GB

TB

Science happens when PBs, TBs, GBs, and KBs can be mashed up simply Science happens when PBs, TBs, GBs, and KBs can be mashed up simply

Page 16: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Scientist

Source Metadata

Scientific Results

AzureMODIS

Service Web Role Portal

Request Queue

Source Imagery Download Sites

. . . •

Reprojection Queue

Reduction Queue

Data Collection Stage

Reprojection Stage

Analysis/Reduction Stage

Catharine van Ingen (Microsoft Research), Jie Li, Marty Humphreys (UVA), Youngryel Ryu (UCB), Deb Agarwal (BWC/LBL)

Page 17: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Continuation “R2”: 2012+

Science and engineering objectives

“Solve the carbon balance problem”

“Build an interoperable data system”

Pilot study “R1”: 2009

20 million observations

Engineering success

Collaborators:

Humberto da Rocha (USP)

Andreas Terzis (JHU)

Juliana Salles, Rob Fatland (MSR)

Brito Cruz (FAPESP)

Page 18: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

••

••

Page 19: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Common Problems with Data

To use data from different sources o Non-standard formats, scales, and units

o Lack of data quality control

o Lack of metadata

o Difficult to repurpose data for different (my) tools

To share data o Lack of incentive (no credit)

o Need extra resources and tools

Hidden problems, seldom addressed o Versioning

o Provenance

o Curation

(data)

SQL

CSV Data

Cube

data data

XML

Data Sources

Page 20: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

(data)

SQL

CSV Data

Cube

data data

XML

Data Sources

Cloud Service HPC Cluster DB Sever Data Server

… Web server …

Applications

… Android iPhone Windows

Phone WebOS

Java Silverlight .NET AJAX PHP Excel MATLAB

Current State of Data Ecosystem

Page 21: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Advance data discoverability, accessibility, and consumability

Marketplace SQL Spatial

PivotViewer

maps http://www.odata.org

A Web protocol for querying and updating data provides a way to unlock your data and free it from data silos

does this by building upon Web technologies such as HTTP, Atom Publishing Protocol (AtomPub) and JSON to provide access to information from a variety of applications, services, and stores.

In Open Source/Specifications Promise

An application of a set of internet standards: HTTP,

Atom (RFC 4287),

AtomPub (RFC 5023),

REST semantics

Existing standards + easy data access API

Adding Geospatial data support – Feedback from the Community encouraged – www.odata.org

It allows you to form URLs based on what you know about the underlying data

Page 22: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

••

••

••

••

Page 23: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

NodeXL

Binary and source code:

http://nodexl.codeplex.com

Network graph visualization

Page 24: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

NodeXL Network Overview Discovery and Exploration add-in for Excel 2007/2010

A minimal network can illustrate the

ways different locations have different

values for centrality and degree

Page 25: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata
Page 26: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

interactive

exploration cinematic narrative

http://www.digitalnarratives.net

Page 28: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

•• speech recognition technologies used to „crack‟ audio files

• Indexing automatic transcripts as text does not work • „real‟ enterprise automatic transcription accuracy is only 50-80%

•• 50-140% accuracy improvement over indexing automatic transcripts

• index word alternatives – robust to recognizer errors

• index timing – navigate to exact point in video

•• No need to invest in H/W infrastructure

Page 29: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

ScienceCinema

••

• Use NLP and Bing Search to expand word dictionary

•• Enables discovery of speech content

• 1,000 hours of AV content currently available

• NEW

http://www.osti.gov/sciencecinema

http://research.microsoft.com/mavis

Page 30: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Seamless Rich Social Media Virtual Sky

Web application for science and education

Goals

Integration of data sets and one-click contextual access

Easy access and use

Tours for sharing information/insights

Updates

API for extensibility

Excel Add-in for easy data integration

We invite you to experience it! www.worldwidetelescope.org

Page 31: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata
Page 32: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

Natural User Interfaces (NUI)

• Rethinking ways in which people will interact with computers/technologies of the future

• Re-evaluating everything from their (non-) physical design to the human needs and interaction models

• Revolutionize the way we think about technology and what it can do on our behalf

http://research.microsoft.com/en-us/um/redmond/projects/kinectsdk/

Page 33: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

NUI – Kinect SDK and WWT

Page 34: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

•••••••

Page 35: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

•••

••

Page 36: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata

http://fourthparadigm.org

Page 37: PLEASE READ (hidden slide) · Common Problems with Data To use data from different sources o Non-standard formats, scales, and units o Lack of data quality control o Lack of metadata