outline

e-mail: [email protected]

© David Shotton, 2007

David Shotton

Image BioInformatics Research GroupOxford e-Research Centre andDepartment of ZoologyUniversity of Oxford, UK

UK Electronic Information Group

Image Management in Bio- and

Environmental Sciences: New Directions

John Rylands Library, University of Manchester

Thursday 31st May 2007

Research images as first class publication objects

Outline

The nature of scientific data and image publication

What data do we actually publish?

Relationship between publications and databases

Improving journal authoring

Data lenses, semantic lenses and live journal content

Integrating distributed data

Data webs

ImageWeb

ImageBLAST

Preserving biological research images

The ImageStore Project

Characteristics of biological research data

Bottom-up data flow, lacking central control

Very large research community with diverse research topics

Highly distributed research activities and publication structures

Research data heterogeneous and largely unstructured, often with little by way of semantic mark-up

An open world, where change is as ubiquitous as consensus is elusive

Where to store research data?

Research results may represent ‘universal truths’, e.g. the sequence of a particular gene These form bounded data sets The data need only be discovered once Such information is typically published in

a large global bioinformatics database

Research data can also be ‘particulars’ rather than ‘universals’, for example individual assay results, microscopy images and wildlife photos

These data form unbounded data sets

Data collection will never be complete

Such image information is not (yet) widely available on line

It is not appropriate to submit such data to centralized global databases

The data are too heterogeneous

Such activities would not scale

What data do we publish?

A scientific paper does not just report scientific observations

Rather, as Anita de Waard of Elsevier has pointed out, a scientific paper is an exercise in rhetoric, designed to convince readers of the truth of a particular scientific hypothesis or belief

The goal of the article is not to state facts, but rather to convince

Facts are selected to support the argument, and are embedded in a rhetorical structure with the purpose of conviction

“These observations support theories that defects of the muscle plasma membrane are important for dystrophic pathogenesis.”

. . . but what about the original research data?

While selected findings that support hypotheses appear in research articles, the majority of original research data are never published

Historically, in the paper age, there was no easy method for doing this

Journals had limited space

Other publication avenues were not available

Now, in this digital age, ‘supplementary information’ can be put on-line

However, this facility is not widely used

Furthermore, such supplementary data are usually poorly structured, with insufficient metadata, and may not be discoverable by external search engines

Depositing data as supplementary information may thus be consigning them to costly data graveyards, from which resurrection is difficult

How might we improve on this situation?

(my take home messages !!)

We need to start treating experimental research data sets as first class publication objects, of equal value to the journal papers based upon them

We need to work towards better interoperability between papers and data

First, two examples of work in progress Then my suggestions for new developments

Convergence between papers and databases

Philip Bourne, Editor-in Chief of PLoS Computational Biology and Co-director of the Protein Data Bank, wrote a stimulating paper:

PLoS Comp. Biol. 2005 1(3) e34

In this, he contends that the distinction between an on-line paper and a database is diminishing

He calls for “seamless integration” between papers reporting results and the data used to compute those results

Author Submission via the Web Depositor Submission via the Web

Syntax Checking Syntax Checking

Review by Scientists &Editors

Review by Annotators

Corrections by Author Corrections by Depositor

Publish – Web Accessible Release – Web Accessible

Similar Processes Lead to Similar Resources

Credit: Philip Bourne

My critique of Philip Bourne’s ideas

I agree with his central analysis of the processes involved. However, this similarity of process should not blind us to essential differences in purpose

We must maintain a clear distinction between the journal publication

peer reviewed

a dated record of the authors’ view at the time of publication

while errata are permitted, the original version should be immutable

and the research database

should contain the most reliable up-to-date information

data quality is initially the responsibility of the depositor

errors subsequently discovered should be corrected by the curator

Thus “seamless integration” is not desirable

One needs to approach publications and data sets with different presuppositional spectacles – the first rhetorical, the other analytical

Researchers really want the “seams” to be very clear, not covered over

Improving the authoring process

Richard O’Beirne of Oxford Journals has stressed that, for publishers to enable their publications to be better used in the digital world, they need to expose metadata of a higher granularity, identifying component pieces of papers

For images, this means figures and their legends

Such mark-up is typically present during the production phases of a paper’s publication, usually in the form of XML, but is ‘lost’ upon publication as PDF

Such metadata needs exposure to facilitate interoperability with data resources

Anita de Waard of the Elsevier Advanced Technology Group is currently developing a system, in conjunction with the editors and authors of Cell, whereby the authors are enabled to create such mark-up while writing the paper

What we need is an easy-to-use plug-in for MS-Word, accepted by all leading publishers, for the creation of suitable text mark-up at the time of authoring

Live (or at least lively) journal content

The norm that the online version of a journal article is a PDF file is antithetical to the spirit of the Web, and ignores its great potential

PDF is an electronic embodiment of a static printed page

Rather, what we need are on-line journals that include tools to deliver renderable interactive views of otherwise static images, and interpretive ‘data lenses’ or ‘semantic lenses’ over published data, thereby enabling new levels of reader comprehension

Semantic lenses specifically provide viewpoints onto RDF data, presenting users with information from selected semantic perspectives

This will require Web delivery of information from multiple resources, involving proper integration of the published paper with research data archives

A data lens showing tsunami damage

A data lens applying a high-pass

filter

A data lens for image analysis

Electron micrograph showing cross sections of microvilli on the surface of intestinal epithelial cells

A live semantic lens demonstration

http://www.cc.gatech.edu/gvu/ui/sub_arctic/sub_arctic/test/sem_lens_test.html

An example from a recent

issue of Biochemistry

Report of a crystallographic structure

Figure 1 from the on-line

version of the

paper, showing

the protein

structure

The PDB entry for Polo-like Kinase 1 (PDB ID 2OU7)

Interactive Jmol representation of Polo-like Kinase 1

http://molvis.sdsc.edu/fgij/fg.htm?mol=2ou7

Another example, from The Plant Cell

All the images in the paper should be clickable videos !

Fusiform bodies within the ER network of arabidopsis stem cortical cells

http://www.brookes.ac.uk/schools/lifesci/research/molcell/hawes/gfpmoviepage.htm

Integrating distributed data

The problems of achieving semantic interoperability between distributed heterogeneous archives of digital data are well known

Previous approaches to solving the problem have involved

distributed query processing

repository federation, or

portals

All shared in common reliance on mainstream technologies such as Z39.50, XML and Web Services, some of which might be considered as dated or heavyweight technologies

None have applied to the problems of data integration the Semantic Web and Web 2.0 approaches that I wish now to describe

Web and Semantic Web standards and tools

We favour the World Wide Web Consortium standards:

RDF as the standard format for sharable metadata

SPARQL as the universal query language for RDF

Software such as D2R Server for abstracting RDF from relational databases in response to SPARQL queries

OWL-DL as the standard web ontology language;

and for software development and integration:

use of agile programming techniques

Ruby or Python to provide a lightweight development environment

loose coupling between the Model, View and Controller software components, based on a simple ‘REST’-full approach to component integration

(Fielding 2000, Representational State Transfer)

publication@source

With the advent of the Semantic Web, the possibility exists to extend the Web paradigm that anyone can publish to include data publication

We are entering the age of distributed data publication

Most research data will in future not be submitted to centralized databases

Rather, data will be published locally by individual research groups, by institutional repositories and by journal publishers, complete with semantically rich metadata that can be harvested and indexed

The database gives way to a distributed ‘data space’ The trick then is to create mechanisms whereby such heterogeneous

distributed data can be integrated and made cross searchable

One mechanism we are now exploring is the data web

Data integration – the lightweight data web approach

The data web is a novel concept for digital information integrationinvolving semantic web technologies

The data are held locally, with metadata published on local Web servers

Separately for each data web serving a particular knowledge domain, automated lightweight software tools will be used to integrate the distributed data

separate metadata schemas will be mapped to a core ontology

instance metadata describing the distributed data will be made available for harvesting as RDF by creating a SPARQL endpoint at each resource

This overcomes syntactic and semantic differences between data providers

Resources can then be discovered by distributed SPARQL queries across the data web

Data web services

Web 2.0 aspects of data webs

Use of the Web as the platform

Small pieces, loosely coupled Programmatic access, giving ‘hackability’ and the right to remix

Tagging:

Data webs are predicated on a formal core ontology, but we see vital roles for user annotations to supplement formal metadata

Trusting our users:

Data providers control their own primary image data and metadata

Data consumers are free to use the data web service in whatever way they think fit, including building secondary services, and providing annotations

The Long Tail:

Data webs enable discovery of ‘long tails’ of hard-to-find data – this is particularly true for research particulars rather than research universals

The ImageWeb Project

Image webs are data webs for research images We desire to integrate and make cross-searchable research

images held by publishers, research organizations, museums and institutional repositories, which are currently in isolated data silos

We desire to enable these information resources

to become a more integral part of day-to-day research, and

for published images to be more fully used than at present, including combination and re-use for meta-research

The same images might be accessed by more than one data web

For example, cellular images might be accessed by one data web illustrating confocal microscopy techniques, and alternatively by another data web concerned with cancer therapy

ImageBLAST – an image web secondary service

I originally imagined that ImageWeb users would directly query the ImageWeb, and from there being led to relevant images

However, I now believe that it might be even more useful for a user to be able to click on an image within an online paper she is reading, and have semantically related images from other sources presented as a ranked list

This service would resemble the basic bioinformatics BLAST service for finding related biological sequences (http://www.ncbi.nlm.nih.gov/BLAST/)

This ‘ImageBLAST’ service would not locate images that resemble the first image in terms of visual appearance, but in terms of being about the same thing

e.g. the same gene expressed in a different organism

or the same biological concept demonstrated in a different system

An example – transplanted GFP-labelled stem cells

Fig. 2. (A and B) Immunohistochemical staining for EGFP on livers of (A) Z/EG x Cre–into-Cre and (B) Z/EG-into-Cre transplants. (C) Immunofluorescence staining with cytokeratin (green) and Y chromosome FISH (red) in the same Z/EG-into-Cre transplant, showing the presence of a donor-derived Y-positive hepatocyte (arrow). (D and E) Immunofluorescence staining of (D) untransplanted positive control (Z/EG x Cre F1) and (E) experimental (Z/EG into Cre) epidermal sections with antibodies against EGFP (green) and cytokeratin AE1/AE3 (red). (F) Immunofluorescence staining with cytokeratin AE1/AE3 (red) and Y chromosome FISH (green), showing the presence of a donor-derived Y-positive keratinocyte (arrow) in the epidermis of a Z/EG-into-Cre transplant recipient.

Related images

How might a data web improve on ?

It permits access to database information hidden in the ‘Deep Web’

It involves specific targeting to a particular knowledge domain, thus achieving a significantly higher signal-to-noise ratio

It provides integration of information with ontological underpinning, semantic coherence, and truth propagation

It permits programmatic access, enabling secondary services to be built on top of one or more data webs

Our present objective

DW-40 : data webs for frictionless interoperability between scientific publications and research datasets

ReferencesIn addition to the papers shown in my presentation itself, please find further details in:

Presentations by Philip Bourne, Anita de Waard, David Karger and David Shotton given at the Research Information Network workshop “Data Webs: new visions for research data on the Web”, 28 June 2006, available at http://www.rin.ac.uk/data-webs.

Erika Darling, Chris Newbern and Nikhil Kalghatgi (Mitre Corporation IR&D) (2005) Reducing visual clutter with semantic lenses. ESRI User Conference July 2005. http://www.themitrecorporation.org/tech/nlvis/pdf/esri_user_conference.pdf.

Anita de Waard (2006) Semantic authoring for scientific publication. Downloadable from www.cs.uu.nl/people/anita/talks/deWaardSWDays0410.pdf.

Anita de Waard and H. van Oostendorp (2005). Development of a semantic structure for scientific articles. Presented at Werkgemeenschap Informatiewetenschap, Antwerp, the Netherlands. http://www.cs.uu.nl/people/anita/papers/deWvanOWIG2710.pdf.

Anita de Waard, Leen Breure, Joost G. Kircz and Herre van Oostendorp (2006) Modeling rhetoric in scientific publications. Presented at INSCIT 2006. http://www.instac.es/inscit2006/papers/pdf/133.pdf.

Roy Thomas Fielding (2000) Architectural styles and the design of network-based software architectures. Chapter 5: Representational state transfer (REST). Ph. D. thesis. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm

Requirements analyses for building a data web for images: http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining_Image_Access.

Details of the ImageWeb Consortium: http://imageweb.zoo.ox.ac.uk/wiki/index.php/BioImageWeb_Consortium.

The Internet and the flow of information

What struck me after compiling that list is that it did not contain a single journal publication!

Why is this?

“The Internet treats censorship as damage, and routes around it”

quote by John Gilmore

The same fate will suffer anything that impedes the free flow of information, including journals

Unless journals adapt to provide the quality and depth of information that users require, they will become increasingly marginalized, as users go elsewhere on the Web to find it

The ImageStore Project

ImageStore: Curation requirements for legacy analogue and ‘born digital’ scientific image data

Purpose: To research the requirements for effective digital curation and re-use of scientific research images from the biological domain

Part of the Digital Curation Centre’s JISC-funded SCARP Project

To adopt a disciple-specific approach to problems of sharing, curation, re-use and preservation of data

To determine curation needs by embedding curation staff within research teams

To give the ImageStore project specific focus, we are investigating the curation requirements for four distinct types of images, two sets of historical analogue records and two sets of modern ‘born digital’ images

The history of molecular and cell biology

Molecular and cell biology began as research disciplines in the 1950s, when the combination of findings from biochemistry, biophysics and electron microscopy gave us the DNA double helix and the first visions of cell ultrastructure and function

Many of the pioneers of molecular and cell biology have now retired or are close to retirement

Their analogue data constitute our scientific cultural heritage, yet most of it will almost certainly be lost if nothing is done soon to curate and archive it

The cost of having to repeat these research observations would far outweigh the cost of preserving the original data

How much data should we save?

It is now technically possible to store as much research data as we wish

But how much is enough?

When is it right not to save data?

For electron microscopy, a good rule of thumb is that for every 1000 EM images taken, 100 will be good, 10 will be superb, and 1 or 2 will make it into print, as figures in a scientific paper

While we should be happy to discard the 900 poor negatives, what we should do with the 98 unpublished good images is a pressing question

Electron microscopy of trypanosomes

Trypanosomes are the causative agents of sleeping sickness Hundreds of electron micrograph negatives – glass photographic

plates – taken over the last 25 years by Professor Keith Gull (Dunn School of Pathology), during his life-long studies of microtubules in trypanosomes

From Broadhead et al., Flagellar motility is

required for the viability of the bloodstream

trypanosome. Nature 440, 224-227

(9 March 2006)

Tsetes fly

Wildlife videos

Wildlife videos of British and African mammals, including badgers and Ethiopian wolves

Created by Professor David Macdonald’s Wildlife Conservation Research Unit (Department of Zoology) over the last 20 years

There are hundreds of analogue videotapes in a variety of formats

Haydon et al.Low-coverage vaccination

strategies for the conservation of endangered species.

Nature 443, 692-695 (12th October 2006)

Computer simulations of the human heart

These models, created by Professor Denis Noble and colleagues (Department of Physiology), permit understanding of hear disease

They form part of the OeRC Integrative Biology e-Science Project

Both the computational models and the resulting digital videos recording the simulations are important artefacts that are shared with overseas collaborators and that required long-term curation

In situ images of gene expression

In situ images revealing the time and place of gene expression in the testes of the fruit fly, Drosophila melanogaster, are important for understanding male sterility in humans

These images are currently being acquired by my colleague Dr Helen White-Cooper (Department of Zoology), as part of a BBSRC project on which I am co-investigator

They are born digital true colour light microscopy images

DNA array images that quantify gene expression also form part of the data

aly cyclinB Mst87F

Male fruit fly

The end

Acknowledgement: I am endebted to Graham

Klyne, with whom my data web ideas have been developed

outline

Documents

supplementary data

data flow

scalewhat data

image publicationwhat

protein data bank

costly data graveyards

research results

research articles