peter buneman research director digital curation centre and school of informatics university of...
TRANSCRIPT
Peter BunemanResearch Director
Digital Curation Centreand
School of Informatics University of Edinburgh
Funders:
The Research Agenda
Digital Curation Centrea centre of expertise in data curation and preservation
2
What is Digital Curation?
• Preserving stuff?– Librarians and archivists– Scientists (with huge amounts
of regular experimental data)
• Publishing stuff?– Publishers of “reference” data:
compendia, dictionaries, bibliographies, gazetteers, etc.
– Scientists (with lots of complex annotated data)
Both communities call themselves “curators” but at first sight they have almost orthogonal concerns
3
Their concerns look orthogonal, but…
• Shouldn’t the “publishers” be concerned about the long-term usefulness of their findings?
• The “preservers” do more than preserve – they classify and annotate.– Shouldn’t they publish (and preserve) their
own work?
As you dig deeper you find that there is a lot of commonality.
4
Curated Databases are Central
Much/most scientific data is now in databases• They often do not contain source experimental data.
Sometimes just annotation/metadata• They borrow extensively from, and refer to, other
databases• You are now judged by your databases as well as your
(paper) publications!!• These databases are built and maintained with a great
deal of human or computational effort.
What makes a database?– it has internal structure or it changes.Size alone doesn’t qualify
5
The Research Agenda• Data integration and publishing
– Slowly coming to market. Publishing in community formats is a new twist• Annotation
– Everybody agrees this is important. No-one understands it.• Metadata extraction
– Semantic or otherwise, it’s a key part of annotation• Archiving and Appraisal
– What do we do about databases – they change!• Legal issues
– Can we at least help to clarify what is going on?• Provenance and data quality
– Again, we don’t fully understand it.• Organisational dynamics of repositories• Economic analyses of curation• Ontologies, performance, registries, structure evolution…
6
Archiving (preserving) databases
• How do you preserve something that changes every hour or minute?– Important for the scientific record – someone
might have cited your data at time t.
• Current practice– Create versions (how often?)– Log changes – Use diffs– Do nothing (common!)
7
A Sequence of Versions
8 [Driscoll, Sarnak, Sleator, Tarjan: “Making Data Structures Persistent.” ]
This relies on a deterministic / keyed model
Pushing time down
9
100 days of OMIM
Siz
e (b
ytes
) x
106
XMill(archive)
gzip(inc diff)
versionarchive, inc diff
Legend•archive•inc diff •version•compressed inc diff•compressed archive
Uncompressed
• Archive size is
1.01 times diff repository size
1.04 times size of largest version
Compressed
• archive size between 0.94 and 1 times compressed diff repository size
• gzip - unix compression tool
• XMill - XML compression tool
10
The Bottom Line
• Can archive a whole year of Swissprot or OMIM with < 15% overhead (size of current file)
• Retrieval is a linear scan
• Works well with compression to less than 30% of current file. Archive is an XML file
• Archive as often as you like! (Almost)
• Works well with indexing
• Permits temporal queries on objects
11
How do we cite data?• A URL or citation to an article is already
unsatisfactory.– DCC client complaint: “I spend a lot of time
searching [electronic documents] for the part that is relevant to the citation.”
• The problem is much worse when you are citing something in a very large database.
• How do you use a citation to locate data?• How do you ensure that the citation
persists?– Connections with DB archiving and DOIs
12
• File and directory names that contain data/timit/train/dr1/fcjf0/sa1.wav
speaker-id: cjf0sex: f
sentence-id: sa1file-type: waveform
dialect-region:1type: training
corpus: timit
• Compound keys traditionally indicated location: BL MS Cotton Nero A.ix
Manuscript in the British Library, which used to be in the library of a Mr. Cotton [which burnt down] under a statue of Nero, top shelf, nine books along from the left.
Location is typically informative?
13
Keys for XML
• Implicit keys are ubiquitous in scientific data formats (easily converted to XML)
• Some proposals for key specifications in XML work (DTD IDs, XML-Schema)
• “Deep citation” in digital libraries.
• Natural consequence of translating back from deterministic model to XML (node-labeled)
• Interactions with data models/formats
14
Relative keys
General form: Q{P1, ... , Pn }. Q’{P’1, ... , P’n’ } ...
Example: book{name}.chapter{number}.verse{number}
number specifiesverse only within chapter
number specifieschapter only within book
Also: bible{}.book{name}.chapter{number}.verse{number}
empty key: at most one bible node
15
Keys and file formats
• Understanding and registering formats is only a first step
• The real issue is still integration and transformation.
• Keys and other constraints may help
Remember: structured files are databases!
16
Data exchange on the Web
All members of a community (industry) agree on a DTD and then exchange data w.r.t. it: e-commerce, health-care, ...
XML Publishing:• mapping relational data to XML• conforming to the predefined DTD
DB1 DB2
XMLDTD
Q: XML view
Web
XML
17
Progress report on DCC research(funding period: -2 weeks)
• Four new research fellows at Edinburgh:– Mags McGinley (legal practice) IP, copyright in databases– James Cheney (Cornell) Programming Languages, Digital
Libraries, XML compression– Tasos Kemensietsidis (Toronto) Data integration, P2P
databases– Rajendra Bose (UCSB) Earth sciences data. “Workflow”
provenance in scientific data.
• At UKOLN– Michael Day, metadata and Interoperability
• At CCLRC– Shoaib Sufi, data portals and metadata
• At Glasgow– Position in metadata extraction advertised
18
Progress report on DCC research(continued)
• Pleasant DCC space (thanks to Edina and Informatics) to house DCC and database group.
• Collaboration with – biologists (EBI & Edinburgh) on data publishing and
– astronomers (Edinburgh) on XML manipulation & representation of large data sets.
• First DCC research visitor (Michael Lesk)
• Work with partners in progress on – annotation
– DOIsPlease join us!!!
19
DCC has research positions in databases, digital curation, XML, web technology, fundamentals.
Top-rated department. World-class database group. Good connections with logical foundations, scientific DBs, distributed computation (Grid)
Edinburgh is a great place!!
Contact Peter Buneman