using oracle intermedia for wellcome trust the large scale ... · the presentation will conclude...
TRANSCRIPT
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Using Oracle Using Oracle interinterMedia forMedia forLarge Scale Image Storage Large Scale Image Storage
Martin WidlakeMartin Widlake –– Database ManagerDatabase ManagerTony Webb Tony Webb –– Real DBAReal DBA
The The WellcomeWellcome Trust Trust Sanger Sanger [email protected]
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Abstract• The Wellcome Trust Sanger Institute have several projects
under development which require the handling and annotation of very large quantities of image data. With Oracle Corp, we areimplementing an Oracle10G solution, using the Intermediafunctionality, which has been extensively developed for 10G and continues to be enhanced.
This talk will cover WTSI’s need for image handling, the proof of concept and the system being implemented now. It also covers the planned development of both the system and alsointerMedia.
The presentation will conclude with a more technical overview of what Intermedia is capable of and will be followed by a workshop.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Format• Martin Widlake from WTSI will introduce and talk
about why we need to handle so much image data and why we are using Oracle interMedia.
• Tony Webb from WTSI will explain how we implemented the proof-of-concept, the issues we hit and the plans for the Gene Atlas project
• Tony will then cover the current plans and also the development areas being looked at with Oracle.
• Melli Annamalai from Oracle will then present the general technical capabilites of Oracle 10G interMedia
• Following the presentation, Melli will host a workshop.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
• The Sanger Institute is a research centre funded primarily by the Wellcome Trust. WTSI is located at Hinxton Hall, Cambridge (UK) in 55 acres of parkland.
• Founded in 1993; Currently over 800 staff members at WTSI, about 170 are IT, 430 are scientific and the boundary can be rather blurred.
• Our purpose is to further the knowledge of the biology of organisms, particularly through large scale sequencing, analysis of their genomes and post-genomic large-scale studies.
• Our lead project has been to sequence a third of the human genome as part of the international Human Genome Project.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
June 2000“In the last decade of the twentieth century, scientists from around the world initiated one of the most significant scientific projects of all time: to determine the DNA sequence of the entire human genome, the human genetic blueprint.” Joint statement, Clinton and BlairBill Clinton, Photo Rod Edmonds
Tony Blair, Photo Christine Nesbitt Sir John Sulston, Photo Matthew Fearn
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Why the Wellcome TrustSanger Institute needs
to handle so muchImage data
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Gene Atlas Project
• Develop a set of antibodies to every known gene in the mouse genome.
• Link the antibodies to dye molecules in order to stain cytological samples.
• Identify where the gene is expressed in the mouse embryo during the development process.
• Create an atlas of where each gene is expressed within the developing mouse, when, and the level of expression.
• An information resource available to the world scientific community.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Gene Atlas Project
ImmunostainingUse antibodies to detect any of the 30k genes.
Highlight where the proteins (and thus genes) are expressed
Not just where, but when.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Gene Atlas Project
• Images are composed of camera frames that may overlap
3
1 2
4
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Gene Atlas ProjectData Volumes
Creation ofprotein
expression constructs
Protein production
Antibody generation and selection
Generation ofTissue arrays
High throughputICC and image
capture
Tissue collection
Image analysisand annotation
Antigenselection
100/month 1000/month
1000/month
500,000/month!
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Gene Atlas Project
Tissue microarray cores (1mm)= 6mb x 400 = 2.4gb x 60 slides = 144gb
Tissue sections (10x15mm)= 900mb x6 sections = 54gb x 60 slides = 324gb
…so 5x106 files of ~6 Mb = ~30 x 1012bytes
= ~30 Tb
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image Handling – GeneTrap Project
• Interrupt known genes by inserting a new piece of DNA into the gene.
• Grow tissue samples with interrupted genes to confluence and take an image of the cell culture.
• Also, during the cell growth process, images of the growing colonies are taken each day to assess colony size and when the culture sample can be taken.
• Keep a record of the plate images to assess culture growth and where satellite colonies are occurring.
• 2-3 TB of images will be gathered and stored for the usual, scientific “period of time”.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image analysis e.g. AD0697Low power (2.5 x) High power (20 x)
•level of expression•subcellular localisation•cell type specificity•proportion of staining
cells (e.g. 15%)
www.sanger.ac.uk
TheWellcome TrustSanger Institute
X-Gal staining shows that trapped gene expression is variable
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Wellscan names cell lines
AD0333 AD0334 AD0335 AD0336 AD0337 AD0338 AD0339 AD0340
AD0341 AD0342 AD0343 AD0344 AD0345 AD0346 AD0347 AD0348
AD0365 AD0366 AD0367 AD0368 AD0369 AD0370 AD0371 AD0372
AD0373 AD0374 AD0375 AD0376 AD0377 AD0378 AD0379 AD0380
AD0349 AD0350 AD0351 AD0352 AD0353 AD0354 AD0355 AD0356
AD0357 AD0358 AD0359 AD0360 AD0361 AD0362 AD0363 AD0364
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image Handling – GeneTrap Project
• 2-3TB of images will be generated by the automated image capturing system.
• A further system will need to capture images of colonies growing on a plate to judge when the colonies are mature.
• The captured and image-processed data will need to link to the LIMS systems we build using Oracle.
• The images also needs to link to data held in MySQLdatabases that they brought with them to the institute.
• Generate a resource for all 30,000 mouse genes, have samples available for each gene that can be ordered and used by any research group .
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image Handling -C.Elegans Phenotyping
• C.Elegans are small nematode worms used as model organisms in studying development.
• Mutations alter the manner in which c.elegans either develops or grows.
• Images are captured via standard light microscopy and then processed by in-house algorithms.
• Relatively simple image processing allows generation of numerical phenotypic data.
• Now the system works they want to keep the images. There will be lots of them…
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image Handling -C.Elegans Phenotyping
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image Handling -C.Elegans Phenotyping
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Image Requirements atWTSI
• Gene Trap project grow up tissue to confluence and wish to keep the images as a reference - 3 TB
• Gene Atlas producing 20 plus TB of image data over the next two years.
• C. Elegans phenotype project – modest data requirements but interesting image processing
• Need to be linked into our LIMS, Sample storage, Robotics systems, and gene variation databases all developed in Oracle.
• Systems need to hold large data volumes, be protected against any data loss and potential allow large numbers of interactive users.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Why use OracleinterMedia?
• Could have handled in file system with gatekeeper software, but backup, control become a major issue
• Want multi user, interactive access.
• We store massive data volumes in Oracle, our LIMS in Oracle…Maybe use Blobs.
• Already testing Oracle 10 Beta and Oracle Corp came and asked us if we had any interest in images…
• Altered Beta test program to include the interMediasuite.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
And they get bigger…Development, Genes and
Tissue Patterns• Will use similar system to Gene Atlas but with much
larger samples.
• Images of 10-100GB made up of many tiles.
• Multiple slices through one sample, say 20 though a mouse brain.
• Repeat for 30,000 genes, for 100’s of stages of development
• Predicted data volume?
1.6 petabytes
www.sanger.ac.uk
TheWellcome TrustSanger Institute
www.sanger.ac.uk
TheWellcome TrustSanger Institute
ImplementinginterMedia
Tony WebbTony WebbSenior DBA at The Wellcome Trust Sanger [email protected]
www.sanger.ac.uk
TheWellcome TrustSanger Institute
The Pilot Scheme
• The Project chosen as the pilot was The Gene ATLAS Project headed up by Dr. Gareth Maslen.
• O/S chosen was RedHat Linux Advanced Server (appropriately patched).
• Initial hardware used was a DL380 Dual Xeon CPU server with 4 Gigs of RAM and 100 Gigs of Enterprise Virtual Array (EVA) disk space.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
The Pilot Scheme
• Recently we moved onto non-BETA releases of interMedia and Oracle (Oracle 10.1.0.2) running on new hardware.
• Earlier this month the original server was decommissioned. This talk, however, covers elements from both environments.
• One of the new machines is a cluster pair of DL380s (atlasdb1a and atlasdb1b) still using 4 Gig of RAM and running OCFS on RedHat Linux Advanced Server.
…
www.sanger.ac.uk
TheWellcome TrustSanger Institute
The Pilot Scheme
• Disk setup on the new cluster is very different from the original machine.
• There is more of it: Terrabytes not Gigabytes!
• Bladestore is fine for backups, bulk writes, bulk reads etc. although it should be used with caution for datafiles.
• Disk space has changed to include a mix of the EVA store (expensive and fast SAN storage) and Bladestore (cheap IDE disks).
• OCFS has been installed as a prerequisite for RAC although we are not currently using RAC on this machine.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Getting The Data into interMedia
• This Is What Currently Happens…
Oracle (HSV)
Oracle (Bladestore)SQLServer
DB
SQLServer2000 Cluster
SAN
Linux Cluster
Workstation
XML Creation
Image Import
interMedia write
Image Files
(AITIFF)
XMLFiles
Image capture
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Installing interMedia• For the original environment, the db server s/w was
downloaded from the Oracle beta tester website and installed error free. Two databases (BETA and TBETA) were then created.
• Installation of the InterMedia s/w was a little troublesome. This was largely attributable to user (or rather dba!) error.
• Installation on the new environment was, however, trouble free thanks largely to use of The Database Configuration Assistant (dbca)
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Installing interMedia
• Installation via dbca is strongly recommended
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Installing interMedia
• interMedia installation should be verified by running the following script:
@<ORACLE_HOME>/ord/im/admin/imchk.sql
Component Status
-------------------------------------------------- ---------COLORFREQUENCIESLIST Public Grant VALIDCOLORFREQUENCIESLIST Varray VALID
……
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Adding Go-faster Stripes
• The interMedia Image Accelerators are not technically essential but we would certainly recommend their use.
• Installing them made a huge difference to performance of one java program, specifically use of:
OrdImage.processCopy(String, OrdImage)
where performance improved dramatically.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Adding Go-faster Stripes
• Software installation in general is much easier, partly due to a slicker installation process.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Database Considerations
• Tablespace organisation needs careful consideration (BIGFILEs; disk types).
• DB Initialisation parameters should be checked against recommendations in the interMedia User Guide(db_nk_block_size; shared_pool_size)
• How are backups going to be implemented? (OS; RMAN).
• Table creation also needs to be fully understood (in-line storage; chunk size; LOB location)
• Will partitioning be used? (Range; Hash, List). This is not an interMedia specific consideration.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
interMedia Demos
• Additional s/w from the OTN website, http://otn.oracle.comwas also downloaded and installed.
e.g. Imgdemo.c
www.sanger.ac.uk
TheWellcome TrustSanger Institute
The Photo Album Servlet
• Another example - The PhotoAlbum Java Servlet
www.sanger.ac.uk
TheWellcome TrustSanger Institute
The Photo Album Servlet
www.sanger.ac.uk
TheWellcome TrustSanger Institute
The Java Environment
• Sample code was ‘tinkered with’ and recompiled into class files, e.g. :($ORACLE_HOME/oc4j/j2ee/home/default-web-app/WEB-INF/classes/MelliPhotoRequest.class)
• Some initial problems setting up environment to run Java but nothing specific required for interMedia beyond changes to CLASSPATH.
• CLASSPATH:$ORACLE_HOME/ord/jlib/ordhttp.jar:$ORACLE_HOME/ord/jlib/ordim.jar:$ORACLE_HOME/jdbc/lib/classes12.jar:$ORACLE_HOME/sqlj/lib/runtime12.jar:$ORACLE_HOME/oc4j/j2ee/home/lib/servlet.jar:/usr/opt/j2sdk1.4.2_04/lib/rt.jar:$ORACLE_HOME/jdbc/lib/classes12.jar
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Prototyping
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Prototyping
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Prototyping
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Going Off At A Tangent..
• But what Does interMedia Data Look Like?
CREATE TABLE PHOTOS
( ID NUMBER NOT NULL,DESCRIPTION VARCHAR2(40) NOT NULL,
LOCATION VARCHAR2(40),
IMAGE ORDSYS.ORDIMAGE,THUMB ORDSYS.ORDIMAGE,DISPLAYORDER VARCHAR2(2),UPLOADTIMESTAMP TIMESTAMP(9))
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Going Off At A Tangent..
• So what is ORDSYS.ORDIMAGE?
• It is an object type (database type). • Owned by database user ORDSYS.• Has attributes and methods.• Attributes are:
source type: ORDSource, Height, width and contentLength type: INTEGER, fileFormat, contentFormat, compressionFormat and mimeType
type:VARCHAR2(4000)• Methods include:
setProperties, checkProperties, getHeight, getFileFormat.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Going Off At A Tangent..
…and What is ORDSYS.ORDSOURCE?• It is another object type.• Also owned by database user ORDSYS.• Again, it has attributes and methods.• Attributes are:
localData type: BLOB,srcType, srcLocation, srcName type: VARCHAR2(4000) updateTime type: DATE, local type: NUMBER
• Methods include:getSourceInformation ,import, export.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
What Else Besides interMedia?
• Tools Currently Being Used By Our Developers include:
• Eclipse (Java IDE)• Apache Webserver• Tomcat (Servlet Container)• OC4J (Oracle Containers for Java)• Xerces (parses XML files -> export files)• JDeveloper• PL/SQL• SQL*Loader
www.sanger.ac.uk
TheWellcome TrustSanger Institute
ImplementinginterMedia
• File Formats Tested In The Pilot:
• TIFF• AITIFF (3rd Party Variant of TIFF)• JPEG • JPEG2000
(although not currently an interMedia datatype)
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Implementing interMedia
“Warts And All”
• <Metalink Bug:3292143> JAVAVM STILL DOESN'T RELEASE MEMORY IN SOME CIRCUMSTANCES WITHIN A SESSION. Seems to be related to specific images but once it happens all subsequent images in that java session will fail.
• Cursors not being closed when running JDeveloper and OC4J. Same code not erroring in TOMCAT.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans for ATLAS
• Overall tests have been encouraging.
• Live system to be created in July with a 400 Gig data load followed by Web Deployment from August?
• Will it scale? Need to investigate partitioning and backup options.
• Probable use of XMLDB, RAC and Oracle Application Server.
• More Automation for ATLAS (SQLServer replaced by Oracle?).
• Working with Oracle to develop new interMedia functionality.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans For interMedia
• The next few slides cover some of the features (in no particular order) that we would like to see incorporated into interMedia and we are working with Oracle in these areas.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans:Image Stitching• Automated microscopy systems produce tiles of pictures,
often several tiles per whole slide.
• Image Stitching is not a simple process and different microscopy systems will implement different algorithms or may not do it at all.
• Carrying out the stitching with one piece of code gives consistency.
• RDBMS products do not support this today.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
FuturePlans:Image Stitching
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans:Image Stitching Requirements
Oracle is currently collecting requirements:
• Stitching: Done by the RDBMS or externally. – It’s more flexible if same algorithm is available both ways.– A single algorithm limits introduced artifacts in the
final image and any auto-annotation stands a good chance of not annotating them!
• Storage options: store tiles and stitch in real-time, or stitch and store the final image
• Annotations: Support for stitched images
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans:Metadata Requirements
• As seen earlier, Oracle interMedia associates physically descriptive metadata with images e.g., size, creation date, etc.
• interMedia updates the associated metadata when an image format transformation occurs.
Requirements:• Metadata association: When images are stitched or
cut, the association should be maintained.
• User-defined metadata support: extend interMediaMetadata to include user-defined attributes.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans:Layers Requirements• Layers can be used to add annotations over an existing
image, leaving the original data unaltered– e.g., a cytological slide can be “drawn over” to highlight those
areas that are of interest.
• A reference template can be laid over an image – e.g., an idealised embryo outline laid over a 7-day mouse
embryo slide, opening a way to automated annotation
Requirement: • Allow one or more annotation layers to be associated with
a single image to create an annotation set.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans: New Image Format RequirementsinterMedia supports popular image formats today
Requirements:
• JPEG 2000: A new format that allows many interesting features - lossless compression, supported by standard browsers (unlike TIFF), extends to ‘movie’ data, allows encryption…
• Dicom: a format being used by medical organisations. Includes specifications for metadata.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans:Automated Annotation• The Gene Atlas project will be able to produce several
thousand images a day.
• An experienced cytologist can annotation a small number of hundreds of images a day…
• Charitable organisations can’t afford to employ 183 cytologists…
• The solution is automated annotation….
… Which is as likely as 3D protein structure prediction to actually work ☺
www.sanger.ac.uk
TheWellcome TrustSanger Institute
Future Plans:Automated Annotation Requirements• Screening Criteria: Support multiple criteria such as
colour depth and distribution to reduce the results set. • Remote Annotation: Augment annotation resources by
allowing the wider scientific community to annotate images remotely.
• Life science community call to action:– Develop image annotation standards – Encourage 3rd party tool and application vendors to adopt these
standards and support Oracle Database as a storage option for images and annotations
• WTSI will have a massive data set, which may well encourage 3rd parties to develop automated systems.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
End Of My Bit
• Over to Melli.
www.sanger.ac.uk
TheWellcome TrustSanger Institute
This Page Deliberately Left Blank
www.sanger.ac.uk
TheWellcome TrustSanger Institute
• ..But Not This One ☺
www.sanger.ac.uk
TheWellcome TrustSanger Institute
ImplementinginterMedia - Appendix
• Loading data via SQL*Loader:sqlldr parfile=sangerex.par
• sanger.par:userid=intermedia/intermedia
control=/oracle/home/oracle/intermedia/melli/sangerex.ctllog=/oracle/home/oracle/intermedia/melli/sangerex.log
direct=y
• sample table:create table stockphotos
(photo_id integer, image ordsys.ordimage);
www.sanger.ac.uk
TheWellcome TrustSanger Institute
ImplementinginterMedia - Appendix
• Sample ctl file:LOAD DATA INFILE * INTO TABLE stockphotos REPLACEFIELDS TERMINATED BY ','(photo_id ,image column object
(source column object( localData_fname FILLER CHAR(12),
localData LOBFILE(image.source.localData_fname) raw terminated by EOF
))
)BEGINDATA1,core1.tif,2,core2.tif
www.sanger.ac.uk
TheWellcome TrustSanger Institute
ImplementinginterMedia - Appendix
• Storing lobs in a different tablespace:CREATE TABLE NEW_IMAGE(
ID NUMBER,NAME VARCHAR2(256),DESCRIPTION VARCHAR2(255),IMG ORDSYS.ORDIMAGE,CREATOR NUMBER NOT NULL,ATLAS_GROUP NUMBER NOT NULL,CREATED DATE NOT NULL,INSERTED DATE NOT NULL,PATH VARCHAR2(256),HOST VARCHAR2(256),URL VARCHAR2(256),THUMBNAIL ORDSYS.ORDIMAGE
)TABLESPACE DATA_01STORAGE (INITIAL 504K
MAXEXTENTS UNLIMITEDPCTINCREASE 0)
LOB (IMG.source.localData, THUMBNAIL.source.localData) STORE AS (TABLESPACE DATA_02 STORAGE (INITIAL 1M NEXT 1M) CHUNK 4);
www.sanger.ac.uk
TheWellcome TrustSanger Institute
ImplementinginterMedia - Appendix
select tablespace_name from user_tables where table_name = 'NEW_IMAGE';
TABLESPACE_NAME------------------------------DATA_01
select column_name, tablespace_name from user_lobs where table_name = 'NEW_IMAGE';
COLUMN_NAME-----------------------------------------------------------------
---------------TABLESPACE_NAME------------------------------"IMG"."SOURCE"."LOCALDATA"DATA_02
"THUMBNAIL"."SOURCE"."LOCALDATA"DATA_02