outline
DESCRIPTION
Outline. The nature of scientific data and image publication What data do we actually publish? Relationship between publications and databases Improving journal authoring Data lenses, semantic lenses and live journal content Integrating distributed data Data webs ImageWeb ImageBLAST - PowerPoint PPT PresentationTRANSCRIPT
e-mail: [email protected]
© David Shotton, 2007
David Shotton
Image BioInformatics Research GroupOxford e-Research Centre andDepartment of ZoologyUniversity of Oxford, UK
UK Electronic Information Group
Image Management in Bio- and
Environmental Sciences: New Directions
John Rylands Library, University of Manchester
Thursday 31st May 2007
Research images as first class publication objects
Outline
The nature of scientific data and image publication
What data do we actually publish?
Relationship between publications and databases
Improving journal authoring
Data lenses, semantic lenses and live journal content
Integrating distributed data
Data webs
ImageWeb
ImageBLAST
Preserving biological research images
The ImageStore Project
Characteristics of biological research data
Bottom-up data flow, lacking central control
Very large research community with diverse research topics
Highly distributed research activities and publication structures
Research data heterogeneous and largely unstructured, often with little by way of semantic mark-up
An open world, where change is as ubiquitous as consensus is elusive
Where to store research data?
Research results may represent ‘universal truths’, e.g. the sequence of a particular gene These form bounded data sets The data need only be discovered once Such information is typically published in
a large global bioinformatics database
Research data can also be ‘particulars’ rather than ‘universals’, for example individual assay results, microscopy images and wildlife photos
These data form unbounded data sets
Data collection will never be complete
Such image information is not (yet) widely available on line
It is not appropriate to submit such data to centralized global databases
The data are too heterogeneous
Such activities would not scale
What data do we publish?
A scientific paper does not just report scientific observations
Rather, as Anita de Waard of Elsevier has pointed out, a scientific paper is an exercise in rhetoric, designed to convince readers of the truth of a particular scientific hypothesis or belief
The goal of the article is not to state facts, but rather to convince
Facts are selected to support the argument, and are embedded in a rhetorical structure with the purpose of conviction
“These observations support theories that defects of the muscle plasma membrane are important for dystrophic pathogenesis.”
. . . but what about the original research data?
While selected findings that support hypotheses appear in research articles, the majority of original research data are never published
Historically, in the paper age, there was no easy method for doing this
Journals had limited space
Other publication avenues were not available
Now, in this digital age, ‘supplementary information’ can be put on-line
However, this facility is not widely used
Furthermore, such supplementary data are usually poorly structured, with insufficient metadata, and may not be discoverable by external search engines
Depositing data as supplementary information may thus be consigning them to costly data graveyards, from which resurrection is difficult
How might we improve on this situation?
(my take home messages !!)
We need to start treating experimental research data sets as first class publication objects, of equal value to the journal papers based upon them
We need to work towards better interoperability between papers and data
First, two examples of work in progress Then my suggestions for new developments
Convergence between papers and databases
Philip Bourne, Editor-in Chief of PLoS Computational Biology and Co-director of the Protein Data Bank, wrote a stimulating paper:
PLoS Comp. Biol. 2005 1(3) e34
In this, he contends that the distinction between an on-line paper and a database is diminishing
He calls for “seamless integration” between papers reporting results and the data used to compute those results
Author Submission via the Web Depositor Submission via the Web
Syntax Checking Syntax Checking
Review by Scientists &Editors
Review by Annotators
Corrections by Author Corrections by Depositor
Publish – Web Accessible Release – Web Accessible
Similar Processes Lead to Similar Resources
Credit: Philip Bourne
My critique of Philip Bourne’s ideas
I agree with his central analysis of the processes involved. However, this similarity of process should not blind us to essential differences in purpose
We must maintain a clear distinction between the journal publication
peer reviewed
a dated record of the authors’ view at the time of publication
while errata are permitted, the original version should be immutable
and the research database
should contain the most reliable up-to-date information
data quality is initially the responsibility of the depositor
errors subsequently discovered should be corrected by the curator
Thus “seamless integration” is not desirable
One needs to approach publications and data sets with different presuppositional spectacles – the first rhetorical, the other analytical
Researchers really want the “seams” to be very clear, not covered over
Improving the authoring process
Richard O’Beirne of Oxford Journals has stressed that, for publishers to enable their publications to be better used in the digital world, they need to expose metadata of a higher granularity, identifying component pieces of papers
For images, this means figures and their legends
Such mark-up is typically present during the production phases of a paper’s publication, usually in the form of XML, but is ‘lost’ upon publication as PDF
Such metadata needs exposure to facilitate interoperability with data resources
Anita de Waard of the Elsevier Advanced Technology Group is currently developing a system, in conjunction with the editors and authors of Cell, whereby the authors are enabled to create such mark-up while writing the paper
What we need is an easy-to-use plug-in for MS-Word, accepted by all leading publishers, for the creation of suitable text mark-up at the time of authoring
Live (or at least lively) journal content
The norm that the online version of a journal article is a PDF file is antithetical to the spirit of the Web, and ignores its great potential
PDF is an electronic embodiment of a static printed page
Rather, what we need are on-line journals that include tools to deliver renderable interactive views of otherwise static images, and interpretive ‘data lenses’ or ‘semantic lenses’ over published data, thereby enabling new levels of reader comprehension
Semantic lenses specifically provide viewpoints onto RDF data, presenting users with information from selected semantic perspectives
This will require Web delivery of information from multiple resources, involving proper integration of the published paper with research data archives
A data lens showing tsunami damage
A data lens applying a high-pass
filter
A data lens for image analysis
Electron micrograph showing cross sections of microvilli on the surface of intestinal epithelial cells
A live semantic lens demonstration
http://www.cc.gatech.edu/gvu/ui/sub_arctic/sub_arctic/test/sem_lens_test.html
An example from a recent
issue of Biochemistry
Report of a crystallographic structure
Figure 1 from the on-line
version of the
paper, showing
the protein
structure
The PDB entry for Polo-like Kinase 1 (PDB ID 2OU7)
Interactive Jmol representation of Polo-like Kinase 1
http://molvis.sdsc.edu/fgij/fg.htm?mol=2ou7
Another example, from The Plant Cell
All the images in the paper should be clickable videos !
Fusiform bodies within the ER network of arabidopsis stem cortical cells
http://www.brookes.ac.uk/schools/lifesci/research/molcell/hawes/gfpmoviepage.htm
Integrating distributed data
The problems of achieving semantic interoperability between distributed heterogeneous archives of digital data are well known
Previous approaches to solving the problem have involved
distributed query processing
repository federation, or
portals
All shared in common reliance on mainstream technologies such as Z39.50, XML and Web Services, some of which might be considered as dated or heavyweight technologies
None have applied to the problems of data integration the Semantic Web and Web 2.0 approaches that I wish now to describe
Web and Semantic Web standards and tools
We favour the World Wide Web Consortium standards:
RDF as the standard format for sharable metadata
SPARQL as the universal query language for RDF
Software such as D2R Server for abstracting RDF from relational databases in response to SPARQL queries
OWL-DL as the standard web ontology language;
and for software development and integration:
use of agile programming techniques
Ruby or Python to provide a lightweight development environment
loose coupling between the Model, View and Controller software components, based on a simple ‘REST’-full approach to component integration
(Fielding 2000, Representational State Transfer)
publication@source
With the advent of the Semantic Web, the possibility exists to extend the Web paradigm that anyone can publish to include data publication
We are entering the age of distributed data publication
Most research data will in future not be submitted to centralized databases
Rather, data will be published locally by individual research groups, by institutional repositories and by journal publishers, complete with semantically rich metadata that can be harvested and indexed
The database gives way to a distributed ‘data space’ The trick then is to create mechanisms whereby such heterogeneous
distributed data can be integrated and made cross searchable
One mechanism we are now exploring is the data web
Data integration – the lightweight data web approach
The data web is a novel concept for digital information integrationinvolving semantic web technologies
The data are held locally, with metadata published on local Web servers
Separately for each data web serving a particular knowledge domain, automated lightweight software tools will be used to integrate the distributed data
separate metadata schemas will be mapped to a core ontology
instance metadata describing the distributed data will be made available for harvesting as RDF by creating a SPARQL endpoint at each resource
This overcomes syntactic and semantic differences between data providers
Resources can then be discovered by distributed SPARQL queries across the data web
Data web services
Web 2.0 aspects of data webs
Use of the Web as the platform
Small pieces, loosely coupled Programmatic access, giving ‘hackability’ and the right to remix
Tagging:
Data webs are predicated on a formal core ontology, but we see vital roles for user annotations to supplement formal metadata
Trusting our users:
Data providers control their own primary image data and metadata
Data consumers are free to use the data web service in whatever way they think fit, including building secondary services, and providing annotations
The Long Tail:
Data webs enable discovery of ‘long tails’ of hard-to-find data – this is particularly true for research particulars rather than research universals
The ImageWeb Project
Image webs are data webs for research images We desire to integrate and make cross-searchable research
images held by publishers, research organizations, museums and institutional repositories, which are currently in isolated data silos
We desire to enable these information resources
to become a more integral part of day-to-day research, and
for published images to be more fully used than at present, including combination and re-use for meta-research
The same images might be accessed by more than one data web
For example, cellular images might be accessed by one data web illustrating confocal microscopy techniques, and alternatively by another data web concerned with cancer therapy
ImageBLAST – an image web secondary service
I originally imagined that ImageWeb users would directly query the ImageWeb, and from there being led to relevant images
However, I now believe that it might be even more useful for a user to be able to click on an image within an online paper she is reading, and have semantically related images from other sources presented as a ranked list
This service would resemble the basic bioinformatics BLAST service for finding related biological sequences (http://www.ncbi.nlm.nih.gov/BLAST/)
This ‘ImageBLAST’ service would not locate images that resemble the first image in terms of visual appearance, but in terms of being about the same thing
e.g. the same gene expressed in a different organism
or the same biological concept demonstrated in a different system
An example – transplanted GFP-labelled stem cells
Fig. 2. (A and B) Immunohistochemical staining for EGFP on livers of (A) Z/EG x Cre–into-Cre and (B) Z/EG-into-Cre transplants. (C) Immunofluorescence staining with cytokeratin (green) and Y chromosome FISH (red) in the same Z/EG-into-Cre transplant, showing the presence of a donor-derived Y-positive hepatocyte (arrow). (D and E) Immunofluorescence staining of (D) untransplanted positive control (Z/EG x Cre F1) and (E) experimental (Z/EG into Cre) epidermal sections with antibodies against EGFP (green) and cytokeratin AE1/AE3 (red). (F) Immunofluorescence staining with cytokeratin AE1/AE3 (red) and Y chromosome FISH (green), showing the presence of a donor-derived Y-positive keratinocyte (arrow) in the epidermis of a Z/EG-into-Cre transplant recipient.
Related images
How might a data web improve on ?
It permits access to database information hidden in the ‘Deep Web’
It involves specific targeting to a particular knowledge domain, thus achieving a significantly higher signal-to-noise ratio
It provides integration of information with ontological underpinning, semantic coherence, and truth propagation
It permits programmatic access, enabling secondary services to be built on top of one or more data webs
Our present objective
DW-40 : data webs for frictionless interoperability between scientific publications and research datasets
ReferencesIn addition to the papers shown in my presentation itself, please find further details in:
Presentations by Philip Bourne, Anita de Waard, David Karger and David Shotton given at the Research Information Network workshop “Data Webs: new visions for research data on the Web”, 28 June 2006, available at http://www.rin.ac.uk/data-webs.
Erika Darling, Chris Newbern and Nikhil Kalghatgi (Mitre Corporation IR&D) (2005) Reducing visual clutter with semantic lenses. ESRI User Conference July 2005. http://www.themitrecorporation.org/tech/nlvis/pdf/esri_user_conference.pdf.
Anita de Waard (2006) Semantic authoring for scientific publication. Downloadable from www.cs.uu.nl/people/anita/talks/deWaardSWDays0410.pdf.
Anita de Waard and H. van Oostendorp (2005). Development of a semantic structure for scientific articles. Presented at Werkgemeenschap Informatiewetenschap, Antwerp, the Netherlands. http://www.cs.uu.nl/people/anita/papers/deWvanOWIG2710.pdf.
Anita de Waard, Leen Breure, Joost G. Kircz and Herre van Oostendorp (2006) Modeling rhetoric in scientific publications. Presented at INSCIT 2006. http://www.instac.es/inscit2006/papers/pdf/133.pdf.
Roy Thomas Fielding (2000) Architectural styles and the design of network-based software architectures. Chapter 5: Representational state transfer (REST). Ph. D. thesis. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~fielding/pubs/dissertation/top.htm
Requirements analyses for building a data web for images: http://imageweb.zoo.ox.ac.uk/wiki/index.php/Defining_Image_Access.
Details of the ImageWeb Consortium: http://imageweb.zoo.ox.ac.uk/wiki/index.php/BioImageWeb_Consortium.
The Internet and the flow of information
What struck me after compiling that list is that it did not contain a single journal publication!
Why is this?
“The Internet treats censorship as damage, and routes around it”
quote by John Gilmore
The same fate will suffer anything that impedes the free flow of information, including journals
Unless journals adapt to provide the quality and depth of information that users require, they will become increasingly marginalized, as users go elsewhere on the Web to find it
The ImageStore Project
ImageStore: Curation requirements for legacy analogue and ‘born digital’ scientific image data
Purpose: To research the requirements for effective digital curation and re-use of scientific research images from the biological domain
Part of the Digital Curation Centre’s JISC-funded SCARP Project
To adopt a disciple-specific approach to problems of sharing, curation, re-use and preservation of data
To determine curation needs by embedding curation staff within research teams
To give the ImageStore project specific focus, we are investigating the curation requirements for four distinct types of images, two sets of historical analogue records and two sets of modern ‘born digital’ images
The history of molecular and cell biology
Molecular and cell biology began as research disciplines in the 1950s, when the combination of findings from biochemistry, biophysics and electron microscopy gave us the DNA double helix and the first visions of cell ultrastructure and function
Many of the pioneers of molecular and cell biology have now retired or are close to retirement
Their analogue data constitute our scientific cultural heritage, yet most of it will almost certainly be lost if nothing is done soon to curate and archive it
The cost of having to repeat these research observations would far outweigh the cost of preserving the original data
How much data should we save?
It is now technically possible to store as much research data as we wish
But how much is enough?
When is it right not to save data?
For electron microscopy, a good rule of thumb is that for every 1000 EM images taken, 100 will be good, 10 will be superb, and 1 or 2 will make it into print, as figures in a scientific paper
While we should be happy to discard the 900 poor negatives, what we should do with the 98 unpublished good images is a pressing question
Electron microscopy of trypanosomes
Trypanosomes are the causative agents of sleeping sickness Hundreds of electron micrograph negatives – glass photographic
plates – taken over the last 25 years by Professor Keith Gull (Dunn School of Pathology), during his life-long studies of microtubules in trypanosomes
From Broadhead et al., Flagellar motility is
required for the viability of the bloodstream
trypanosome. Nature 440, 224-227
(9 March 2006)
Tsetes fly
Wildlife videos
Wildlife videos of British and African mammals, including badgers and Ethiopian wolves
Created by Professor David Macdonald’s Wildlife Conservation Research Unit (Department of Zoology) over the last 20 years
There are hundreds of analogue videotapes in a variety of formats
Haydon et al.Low-coverage vaccination
strategies for the conservation of endangered species.
Nature 443, 692-695 (12th October 2006)
Computer simulations of the human heart
These models, created by Professor Denis Noble and colleagues (Department of Physiology), permit understanding of hear disease
They form part of the OeRC Integrative Biology e-Science Project
Both the computational models and the resulting digital videos recording the simulations are important artefacts that are shared with overseas collaborators and that required long-term curation
In situ images of gene expression
In situ images revealing the time and place of gene expression in the testes of the fruit fly, Drosophila melanogaster, are important for understanding male sterility in humans
These images are currently being acquired by my colleague Dr Helen White-Cooper (Department of Zoology), as part of a BBSRC project on which I am co-investigator
They are born digital true colour light microscopy images
DNA array images that quantify gene expression also form part of the data
aly cyclinB Mst87F
Male fruit fly
The end
Acknowledgement: I am endebted to Graham
Klyne, with whom my data web ideas have been developed