berlin 6 open access conference: tony hey

42
eScience and Open Access Supporting Data-Centric Research with Client + Cloud Tony Hey Corporate Vice President Microsoft Research

Upload: cornelius-puschmann

Post on 07-May-2015

946 views

Category:

Education


1 download

DESCRIPTION

www.berlin6.org

TRANSCRIPT

Page 1: Berlin 6 Open Access Conference: Tony Hey

eScience and Open Access Supporting Data-Centric Research

with Client + Cloud

Tony HeyCorporate Vice President

Microsoft Research

Page 2: Berlin 6 Open Access Conference: Tony Hey

The Fourth Paradigm: Data-Centric Science

Page 3: Berlin 6 Open Access Conference: Tony Hey

• Data collection– Sensor networks, satellite

surveys, high throughput laboratory instruments, observation devices, supercomputers, LHC …

• Data processing, analysis, visualization– Legacy codes, workflows,

data mining, indexing, searching, graphics …

• Archiving– Digital repositories, libraries,

preservation, …

SensorMapFunctionality: Map navigationData: sensor-generated temperature, video camera feed, traffic feeds, etc.

Scientific visualizationsNSF Cyberinfrastructure report, March 2007

Page 4: Berlin 6 Open Access Conference: Tony Hey

1. Thousand years ago – Experimental Science– Description of natural phenomena

2. Last few hundred years – Theoretical Science– Newton’s Laws, Maxwell’s Equations…

3. Last few decades – Computational Science– Simulation of complex phenomena

4. Today – Data-centric Science– Scientists overwhelmed with data sets

from many different sources • Data captured by instruments• Data generated by simulations• Data generated by sensor networks

– eScience is the set of tools and technologiesto support data federation and collaboration

• For analysis and data mining• For data visualization and exploration• For scholarly communication and dissemination

(With thanks to Jim Gray)

2

22.

3

4

a

cG

a

a

2

22.

3

4

a

cG

a

a

Page 5: Berlin 6 Open Access Conference: Tony Hey

• The Sloan Digital Sky Survey is the first major astronomical survey project: – 5 color images of ¼ of the sky– Pictures of 300 million celestial objects– Distances to the closest 1 million galaxies

• Jim Gray from Microsoft Research worked with astronomer Alex Szalay to build the public ‘SkyServer’ archive for the survey

New model of scientific publishing– Have to publish the data before astronomers

publish their analysis

Page 6: Berlin 6 Open Access Conference: Tony Hey

• Posterchild in 21st century data publishing– 380 million web hits in 6 years– 930,000 distinct users

vs 10,000 astronomers– 1600 scientific papers– Delivered 50,000 hours

of lectures to high schools– Delivered 100B rows of data

World’s most used astronomy facility over last 2 years

Page 7: Berlin 6 Open Access Conference: Tony Hey

• Goal of 1 million visual galaxy classifications by the public

• Enormous publicity (CNN, Times, Washington Post, BBC)

• 100,000 people participating, blogs, poems …

Allows general public to search for photographs and classify different types of galaxies

Page 8: Berlin 6 Open Access Conference: Tony Hey
Page 9: Berlin 6 Open Access Conference: Tony Hey

Seamless Rich Social Media Virtual SkySeamless Rich Social Media Virtual SkyWeb application for science and educationWeb application for science and education

ParticipantsAlyssa Goodman; Harvard UniversityAlex Szalay; Johns Hopkins UniversityCurtis Wong, Jonathan Fay; Microsoft Research

GoalsIntegration of data sets and one-click contextual accessEasy access and useIn just over a little more than two months, a million users have downloaded, installed and launched the application (2,206,497 unique sessions)

We invite you to experience it! www.worldwidetelescope.org

Page 10: Berlin 6 Open Access Conference: Tony Hey
Page 11: Berlin 6 Open Access Conference: Tony Hey

• Journal subscriptions rising faster than library budgets. No freedom for new journals in new and emerging fields.

• Web technology and digital media now make dissemination of knowledge ‘easy’ and ‘free’ without the traditional paper journals.

• As Dean of Engineering at Southampton: Supposed to monitor the research output of over 200

Faculty and 500 Post Docs and Grad StudentsUniversity library could not afford to subscribe to all the

journals that my staff published in, not to mention conference proceedings and workshop contributions …

Page 12: Berlin 6 Open Access Conference: Tony Hey
Page 13: Berlin 6 Open Access Conference: Tony Hey
Page 14: Berlin 6 Open Access Conference: Tony Hey

Requests for ETDs grew from around 220,000 in 1997/98 to nearly 20M by 2006/07

Page 15: Berlin 6 Open Access Conference: Tony Hey

SciELO (scientific electronic library online) is a virtual library for Latin-America, the Caribbean, Spain and Portugal.

It consists of a network:

•Regional collections (SciELO Brazil, SciELO Chile, SciELO Cuba, SciELO Colombia, etc)

•Thematic areas (SciELO public health)

The library forms part of a project being developed by FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) in collaboration with BIREME (Centro Latinoamericano y del Caribe de Información en Ciencias de la Salud).

The FAPESP/BIREME project envisages developing a common methodology for preparing, storing, disseminating and evaluating scientific literature in electronic form.

Page 16: Berlin 6 Open Access Conference: Tony Hey
Page 17: Berlin 6 Open Access Conference: Tony Hey

Month

SciELOPages

Translated

Page 18: Berlin 6 Open Access Conference: Tony Hey

Supporting researchers worldwide

The Research Lifecycle

Page 19: Berlin 6 Open Access Conference: Tony Hey

Open access Open source Open data

http://www.microsoft.com/interop/

“In order to help catalyze and facilitate the growth of advanced CI, a critical component is the adoption of open access policy for data, publications and software.”

NSF Advisory Committee on Cyberinfrastructure (ACCI)

Microsoft Interoperability PrinciplesOpen Connections to Microsoft ProductsSupport for StandardsData PortabilityEngagement with Open Source community

Page 20: Berlin 6 Open Access Conference: Tony Hey

20

What does this mean?

You go to a great web site

It supports OpenID

No need to create/manage

yet another account

You can now use Live ID to

authenticate

Page 21: Berlin 6 Open Access Conference: Tony Hey

• Great for interoperability– Many sites support OpenID-based authentication– Other major ID providers are working on v2.0 of the

protocol (e.g. Yahoo, Google)

• No need to manage account-related information on multiple sites– Name, email, web site, interests, etc.

• You control which sites are allowed to access your profile information and/or authenticate

21

Page 22: Berlin 6 Open Access Conference: Tony Hey

Insert Creative Commons licenses from any Office 2007 application

Insert Creative Commons licenses from any Office 2007 application

Incorporate license information in the OOXML so that the license can be read even without Office installed

Incorporate license information in the OOXML so that the license can be read even without Office installed

Integration with the Creative Commons Web API so that new licenses can be created

Integration with the Creative Commons Web API so that new licenses can be created

Page 23: Berlin 6 Open Access Conference: Tony Hey

• Data Acquisition and Modeling– Data capture from source, cleaning, storage, etc.– SQL Server, SSIS, Windows WF

• Support Collaboration– Allow researchers to work together, share context, facilitate interactions– SharePoint Server, One Note 2007 (shared)

• Data Analysis, Modeling, and Visualization– Mining techniques (OLAP, cubes) and visual analytics– SQL Analysis Services, BI, Excel, Optima, SILK (MSR-A)

• Disseminate and Share Research Outputs– Publish, Present, Blog, Review and Rate– Word, PowerPoint

• Archiving– Published literature, reference data, curated data, etc.– SQL Server

23

Microsoft has technologies that can offer end-to-end support

eScience is the setof tools and technologiesto support Data-centric Science

Page 24: Berlin 6 Open Access Conference: Tony Hey
Page 25: Berlin 6 Open Access Conference: Tony Hey

Semantic Annotations in WordSemantic Annotations in Word

• Phil Bourne and Lynn Fink, UCSD

Goals•Semantic mark-up using ontologies and controlled vocabularies•Facilitate/automate referencing to PDB (and other resources) from manuscript•Conversion of manuscript to NLM DTD for direct submission to publisher

Scenario• Authors do not need to be aware of the use of semantic technologies• A domain-specific ontology is downloaded and made available from within Microsoft Word 2007• Authors can record their intention, the meaning of the terms they use based on their community’s agreed vocabulary

Attribution: Richard Cyganiak

Page 26: Berlin 6 Open Access Conference: Tony Hey

Chemistry Drawing for OfficeChemistry Drawing for Office

• Peter Murray Rust, Univ. of Cambridge• Murray Sargent, Office• Geraldine Wade, Advanced Reading

Technologies

Goals•Support students/researchers in simple chemistry structure authoring/editing•Enable ecosystem of tools around lifecycle of chemistry-related scholarly works•Support the Chemistry Markup Language•Proof of concept plug-in

Execution•MSR Developer working on the proof of concept•Post-doc in Cambridge using prototype lug-in and giving feedback •Advanced Reading Technologies creating necessary glyphs

Page 27: Berlin 6 Open Access Conference: Tony Hey

27

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>

<?xml version="1.0" ?><cml version="3" convention="org-synth-report" xmlns="http://www.xml-cml.org/schema"> <molecule id="m1"> <atomArray> <atom id="a1" elementType="C" x2="-2.9149999618530273" y2="0.7699999809265137" /> <atom id="a2" elementType="C" x2="-1.5813208400249916" y2="1.5399999809265137" /> <atom id="a3" elementType="O" x2="-0.24764171819695613" y2="0.7699999809265134" /> <atom id="a4" elementType="O" x2="-1.5813208400249912" y2="3.0799999809265137" /> <atom id="a5" elementType="H" x2="-4.248679083681063" y2="1.5399999809265137" /> <atom id="a6" elementType="H" x2="-2.914999961853028" y2="-0.7700000190734864" /> <atom id="a7" elementType="H" x2="-4.248679083681063" y2="-1.907348645691087E-8" /> <atom id="a8" elementType="H" x2="1.0860374036310796" y2="1.5399999809265132" /> </atomArray> <bondArray> <bond atomRefs2="a1 a2" order="1" /> <bond atomRefs2="a2 a3" order="1" /> <bond atomRefs2="a2 a4" order="2" /> <bond atomRefs2="a1 a5" order="1" /> <bond atomRefs2="a1 a6" order="1" /> <bond atomRefs2="a1 a7" order="1" /> <bond atomRefs2="a3 a8" order="1" /> </bondArray> </molecule></cml>

Molecule added in Word*

* This is just a screenshot from a very early prototype.We are actively working on significantly improving the quality of the rendering.

CML stored in OOXML

Page 28: Berlin 6 Open Access Conference: Tony Hey

• Large collaboration project focusing on interoperability

• OAI-ORE as the focus• Model for semantically

representing eChemistry research work

• Repository to store generated models

• Models will be generated and consumed by tools used by chemistry researchers (e.g. PubChem, eCrystals, TableSheer, Excel, Word)

Project Collaborators• Lee Dirks, Alex Wade (Microsoft)• Carl Lagoze (Cornell University)• Geoffrey Fox, Marlon Pierce (Indiana University)• Peter Murray-Rust (University of Cambridge)• Herbert Van de Sompel (LANL)*• Steve Bryant (PubChem)*• C. Lee Giles, Prasenjit Mitra, Karl Mueller (Penn State)• Jeremy Frey, Simon Coles (University of Southampton)

* Advisory roles

28

Page 29: Berlin 6 Open Access Conference: Tony Hey

Organization•High-profile EU Commission Project, €14M for 4 years •Consortium of 5 national libraries, 4 national archives, 4 universities and 4 industry partners

Goals•Preservation of Office Documents based on OpenXML•Deliver converters for MS Office binary formats •Funded open source project for ODF to/from OpenXML converter•Deliver Preservation Toolkit

PLANETSTools and methods for sustainable long-term

preservation of digital objects

PLANETSTools and methods for sustainable long-term

preservation of digital objects

Page 30: Berlin 6 Open Access Conference: Tony Hey

Cloud Computing

Page 31: Berlin 6 Open Access Conference: Tony Hey

• A Windows OS for the Cloud– Platform for developing

and hosting Cloud applications

– Built to take advantage of datacenters

– Exposes services that hosted applications can leverage

• Computer, storage, messaging, etc.

• VMs on the horizonSource: Azure Services Platform whitepaper

http://www.azure.com/

Page 32: Berlin 6 Open Access Conference: Tony Hey

• Live Services– Live Mesh, Desktop-in-the-cloud, files in the cloud– Mostly consumer-oriented services– Developer APIs for building consumer-oriented

services and applications in the Cloud

• .NET Services– Enterprise services

• Workflow, Identity and authentication, etc.

– Can be integrated by business applications

Page 33: Berlin 6 Open Access Conference: Tony Hey

• A collaboration space in the Cloud• Current functionality

– Upload documents, invite people to join the workspace, share, collaborate on documents

• More great features coming soon– Office Web applications announced at PDC08

http://workspace.officelive.com/

Page 34: Berlin 6 Open Access Conference: Tony Hey

• Expect scientific research environments will follow similar trends to the commercial sector– Leverage computing and data storage in the cloud– Scientists already experimenting with services

• For many of the same reasons– Siloed research teams, no resource sharing across labs– High storage costs– Low resource utilization– Excess capacity– High costs of reliably keeping machines up-to-date– Little support for developers, system operators

34

Page 35: Berlin 6 Open Access Conference: Tony Hey

RIC VRE VISION

VIRTUAL RESEARCH VIRTUAL RESEARCH ENVIRONMENTENVIRONMENT

Online tools and services to enhance research process.

Facilitate collaboration among researchers .Provide a more effective means to work together. • Available as a service with minimal barrier to entry…

Most of the technology comprising a VRE is evolutionary.• It is the convergence of these technologies that is unique…

Page 36: Berlin 6 Open Access Conference: Tony Hey

CONTENT CONTENT MANAGEMENTMANAGEMENT

Tools to organize, manage and control digital content.

Typical features include automated templates, organization, versioning, workflow management, document management, and content virtualization.

KNOWLEDGE KNOWLEDGE MANAGEMENTMANAGEMENT

Tools for individuals and teams to distribute and share knowledge.Tools, such as blogs and wikis for more unstructured, self-governing approach to knowledge transfer, and the capture, and creation of knowledge through the development of new forms of community (rating, ranking, etc).

SOCIAL SOCIAL NETWORKINGNETWORKING

Individuals or teams can express identify and identify collaborators.Primary features are profile pages, groups, self expression, content creation tools, content sharing, blogs and forums, recommendations and tagging.

ONLINE ONLINE COLLABORATIONCOLLABORATION

Tools that facilitate working together to achieve a common goal.

Enable individuals to find each other and the information they need, communicate, and work together to achieve a common goal. Core elements are messaging, groupware, real-time collaboration and communication.

Page 37: Berlin 6 Open Access Conference: Tony Hey

Existing RIC Members

Remember Me

Login

New to RIC?

Sign Up

Username:

Password:

Forgot your ID or Password?

Plan The Research

Search for study ideas, plan the study, and apply for funding.

Network

Connect with fellow researchers for sharing ideas, resources etc.

Experiment

Use online tools to achieve faster results.

Publish

Disseminate the study results for the public.

British Library for Research

A one stop solution for carrying out research studies in planned & phased manner and networking with fellow community members

Currently in beta evaluation, directed by The British Library.

Page 38: Berlin 6 Open Access Conference: Tony Hey

• Exchange, Sharepoint, Live Meeting, Dynamics CRM, etc.• No need to build your own infrastructure or

maintain/manage servers• Moving forward, even research services could move to

the Cloud

http://www.microsoft.com/online/

Page 39: Berlin 6 Open Access Conference: Tony Hey

• Important/key considerations– Formats or “well-known” representations

of data/information– Pervasive access protocols are key (e.g.

HTTP)– Data/information is uniquely identified

(e.g. URIs)– Links/associations between

data/information

• Data/information is inter-connected through machine-interpretable information (e.g. paper X is about star Y)

• Social networks are a special case of ‘data meshes’

Attribution: Richard Cyganiak

Page 40: Berlin 6 Open Access Conference: Tony Hey

scholarly communications

scholarly communications

domain-specific servicesdomain-specific services

instant messaging

instant messaging

identityidentity

document storedocument store

blogs &social networking

blogs &social networking

mailmail

notificationnotification

searchbooks

citations

searchbooks

citations

visualization and analysis servicesvisualization and analysis services

storage/data services

storage/data services

computeservices

virtualization

computeservices

virtualization

Project management

Project management

Reference management

Reference management

knowledge management

knowledge management

knowledge discovery

knowledge discovery

Vision of Future ResearchEnvironment with bothSoftware + Services

Page 41: Berlin 6 Open Access Conference: Tony Hey

• http://research.microsoft.com/• MSR downloads:

http://research.microsoft.com/research/downloads/• http://www.microsoft.com/science• http://www.microsoft.com/mscorp/tc/

scholarly_communication.mspx• CodePlex: http://www.codeplex.com/• The Faculty Connection;

http://www.microsoft.com/education/facultyconnection/• MSDN Academic Alliance;

http://msdn.microsoft.com/en-us/academic/

Page 42: Berlin 6 Open Access Conference: Tony Hey