+ the open access publisher manage a variety of life sciences data matthew cockerill technical...

42
+ The Open Access Publisher

Upload: alvin-stone

Post on 27-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

+

The Open Access Publisher

Manage a Variety of Life Sciences Data

Matthew Cockerill

Technical Director

BioMed Central

Session id: 40363

BioMed Central and Oracle

BioMed Central is an Open Access publisher of biomedical research

Oracle database technology used to deliver a cost-effective online publishing solution

Goals– Make the publishing process more efficient

through online tools and automation– Increase accessibility of research by removing

subscription barriers

Oracle technology used by BioMed Central

– XML DB– Oracle interMedia– Java

– Real Application Clusters– Data Guard– Oracle Text

BioMed Central’s database– 70 gigabytes of data (and growing rapidly)– Lots of traditional relational data

(e.g. 250,000 registered users)– Also serves as a repository for images, movies, PDFs

and other rich media

Key technologies used

Why Open Access?

Subscription-only access to scientific research is a legacy of the economics of print

Scientists do all the hard work – performing the research– writing up the article– acting as peer reviewers– acting as journal editors

Traditional publishers take ownership of the copyright and sell limited access back to the scientific community

In the age of the web that makes no sense for science Open Access publishers make research freely accessible

and redistributable by scientists

Benefits of Open Access

Research instantly accessible to the entire scientific community

Digital permanence (many copies) A route off the subscriptions treadmill

– Subscriptions to traditional journals have increased at 10-15% per annum

Data mining Grid computing

Tony Blair

“[The] national e-science grid … intends to make access to

computing power, scientific data repositories and

experimental facilities as easy as the web makes access to

information.”- Tony Blair, May 2002

The Open Access movement

Public Library of Science– New not-for-profit publisher formed by a group of scientists– Has received $9m from Gordon and Betty Moore

Foundation to start new Open Access journals Soros Foundation

– Has provided $3m to support Open Access publishing in developing and transitional countries

Sabo bill– Congressman Martin Sabo recently introduced the Public

Access to Science Act in Congress– If passed it would ensure that all US federally funded

research would be published with Open Access

Business model

Minimize costs via smart use of technology Cover costs of publication via Article

Processing Charge ($500) Free submission for authors from institutions

that become members More than 300 institutional members already

– e.g.US National Institutes of Health, the National Health Service of England, all campuses of the University of California, Harvard, Yale, Princeton, and every academic institution in the UK

Economic benefits

Cost per article access for Elsevier journals estimated at approx US $11(Washington State University estimate)

A typical BioMed Central open-access research article receives 2000+ full text accesses in the 2 years following publication, and a similar number of accesses again via mirrors (e.g. PubMed Central)

Cost to the scientific community = $500. So cost per access: $500/(2000*2) = $0.125

Cost per access reduced almost 100-fold!

BioMed Central architecture Oracle9i Database

– Stores relational data (e.g. user registration info)

– Also acts as repository for media files associated with submitted manuscripts published articles

Web server farm– Runs many different journal websites,

all driven by the same Oracle database– Extensive use of Java and XSLT– Media content streamed from the

database using servlets

9i

Key Oracle Technologies used by BioMed Central

Real Application Clusters Data Guard Oracle Text XML DB Oracle interMedia Java

Key Oracle Technologies used by BioMed Central

Real Application Clusters Data Guard Oracle Text XML DB Oracle interMedia Java

Importance of high availability

Science is a global enterprise, so BioMed Central’s websites are busy 24 hours a day

Scientists entrust their research and reputation to us - they must have confidence that their research will be available

Major institutional customers demand high reliability

BioMed Central delivers high availability using a combination of RAC and Data Guard

Real Application Clusters

BioMed Central was one of the first organizations in the UK to deploy 9i RAC

Main database runs on a pair of dual CPU Sun Fire V480 servers

Delivers high availability in the event of single node failure

Oracle upgrades/patches do currently require downtime however (for now!)

Data Guard

BioMed Central uses Data Guard to maintain a standby database

Standby database kept up to date by automated application of log files

Standby database can be used for reporting (in read-only mode)

If a prolonged outage of live db occurs (planned or unplanned), standby database can be activated

Data Guard makes it easy to roll back to the live configuration after planned outages

RAC/Data Guard configuration

RAC Cluster Standby DB(Data Guard)

Web server farm

Main hosting location Standby location

Reporting

logfiles

RAC/Data Guard configuration

RAC Cluster Standby DB(Data Guard)

Web server farm

Main hosting location Standby location

Reporting

Key Oracle Technologies used by BioMed Central

Real Application Clusters Data Guard Oracle Text XML DB Oracle interMedia Java

Use of Oracle Text

High performance full text article search Key benefits

– Ease of maintenance (incremental online indexing)

– Structured searching of XML– XPath support– Unicode aware (smart base-character indexing)– Filter procedures can be used to transform XML

to be indexed

Structured search

XPath search

Prior to Oracle9i Database Release 2, relatively basic field restrictions based on XML tags were possible

Complex nesting of tags, or specific attribute values were difficult or impossible to search for

Oracle9i Database Release 2 support for Xpath field restrictions takes XML searching to another level

Now possible to search for all XML articles that contain a certain path (HASPATH), or that match a certain text expression at that path (INPATH)

XPath example

Article metadata identifying a series of related articles

<meta> <classifications> <classification type="BMC" subtype="review_series_title" id="ar-cell-cell">Cell-cell interactions in synovitis</classification> </classifications> </meta>

SQL syntax to retrieve all articles in that review series

SELECT ARX_ID FROM ARX WHERE CONTAINS (ARX_FULL, 'HASPATH (//classification[@type="BMC“ AND @subtype="review_series_title" AND @id="ar-cell-cell"])')>0;

Smart handling of Unicode

Key Oracle Technologies used by BioMed Central

Real Application Clusters Data Guard Oracle Text XML DB Oracle interMedia Java

XML DB

Oracle’s support for XML standards in the database allows BioMed Central to manage article XML data within database

Examples of use– Re-validate article XML against DTD after any

update– Application of XSLT transformations within

database (e.g. as a pre-indexing filter)

Article XML (pre-transform)<bibl> <title> Genetic variability in MCF-7 sublines</title> <aug> <au id="A1"> <snm>Nugoli</snm> <fnm>Melanie</fnm> <mi>JK</mi> <email>[email protected]</email> </au> <au id="A2"> <snm>Chuchana</snm> <fnm>Paul</fnm> <email>[email protected]</email> </au> </aug> <source>BMC Medical Research Methodology</source>…</bibl>

Article XML (post-transform)<bibl> <title> Genetic variability in MCF-7 sublines</title> <aug> <au id="A1"> <snm>Nugoli</snm> <fnm>Melanie</fnm> <mi>JK</mi> <bnm>Nugoli_MJK</bnm>

<email>[email protected]</email> </au> <au id="A2"> <snm>Chuchana</snm> <fnm>Paul</fnm> <bnm>Chuchana_P</bnm> <email>[email protected]</email> </au> </aug> <source> <sourcefull>BMC Medical Research Methodology</sourcefull> <sourceabbr>BMC Med Res Methodol</sourceabbr> </source> …</bibl>

XML in action: Faculty of 1000

Literature awareness service for scientists More than 1000 experts submit evaluations of

the best new scientific research (via the web) Evaluations rank articles by level of interest,

and classify them by type and by subject Faculty of 1000 website digest this info into a

a listing of ‘hot articles’ XML use is critical to performance

Faculty of 1000 - typical article

XML improves performance of Faculty of 1000

Navigating deeply relational data can be slow Web application data is searched and accessed

frequently, but changes relatively rarely Solution:

Use Oracle triggers to regenerate an XML summary column for each live record, whenever data affecting that record changes

This kills two birds with one stone– Structure of XML tuned to allow any required search/

browse to be done efficiently as a pure Oracle Text query– XML summary can easily be converted to HTML using

XSLT for display on the web

Key Oracle Technologies used by BioMed Central

Real Application Clusters DataGuard Oracle Text XML DB Oracle interMedia Java

interMedia: Oracle as a media repository

Manuscript submission and workflow involves a complex interplay of files and metadata

Storing files directly in the database as BLOBs makes their management and manipulation much simpler

interMedia provides a powerful set of tools to work with images in the database

– Extracting image metadata– Scaling/cropping/format conversion

Full text article

Figure streamed from db

PDF streamed from database

Processing submitted files

Using interMedia to manipulate images

Key Oracle Technologies used by BioMed Central

Real Application Clusters DataGuard Oracle Text XML DB Oracle interMedia Java

Java in the database

Java stored procedures offer flexibility Facilitate transport of code between tiers Allow use of standard Java libraries One example:

– Oracle’s original XSLT implementation performed poorly with BioMed Central’s large article XML files and this was rate limiting for our article XML indexing/filtering process

– Thanks to the JVM built into the database, we were able to make use of the open source XSLTC implementation

– End result was that we cut re-index time in half

Find out more…

See a demonstration– BioMed Central’s use of Oracle interMedia

technology is being demonstrated on the conference floor

Or take a look for yourself– http://www.biomedcentral.com/– http:///www.facultyof1000.com/

ASpeaker NameSpeaker TitleSpeaker TitleOracle Corporation

Q&Q U E S T I O N SQ U E S T I O N SA N S W E R SA N S W E R S

Q&A