embl-ebi database replication - distribution. embl-ebi relational public databases ebi’s mission...

EMBL-EBI

Database

Replication - Distribution

EMBL-EBI

Relational public databases

EBI’s mission to provide freely accessible information on the public domain

Data formats and technologies, should not contradict to this policy

Adopt widely accepted, successful standards that are well known and used

Free access not only in the information content, but in the supporting technologies

Reasonable investment in resources and expertise by users so that the data is accessible to a wider audience

But without a severe restriction to the benefits to the usersA trade-off situation, different users, different needs

Relational databases are an industry standardVendors have different implementations but there are underlying

formal standardsANSI-SQL for query expressionODBC, JDBC for API’s

EMBL-EBI

RDB’s versus flat files

Relational databases are flexible, powerful and consistent

They are a lot more complex They impose data organisation that can’t be easily

vertically partitioned Organising and inter-exchanging data on a per-entry basis does not

come by default

Physical implementations are not standard Remember the days (or imagine) flat files without a common

character encoding standard (without ASCII around)

Vendors support migration of other databases to their own but not the other way-round

There is not a common vendor-independent exchange or dump format

This is not trivial due to differences in implementation details and extensions on the standards

EMBL-EBI

Why Replicate?

To take advantage of local hardware and CPU time – some operations are simply not possible on-line

To avoid continuous dependency on network and EBI resources

To extend or merge information with other databases or data sources

To utilise the information in new innovative ways

To ensure confidentiality of research

EMBL-EBI

MSD replication options

We offer MSDSD in OracleWith indexes pre-builtImplementation uses Oracle import-exportWith frequent (weekly) incrementals so that new entries are

becoming available soonUsers need to have Oracle licenceWe have more experience and offer better support

Or in mySQLIn compressed myIsam format without indexesWe give directly the mySQL data-files (they are platform and version

independent)We don’t offer weekly increments but new full releases every few

monthsWe recommend the Oracle distribution for advanced usersBut mySQL is great if they can’t afford OracleOr want to evaluate the MSDSD database

EMBL-EBI

Replication Components

Database copy on Sun Solaris

Schema export-import plus sql-loader files for creating the database initially for Oracle on other platforms

Possibility to Import to Non Oracle databases (MySQL)

Periodic synchronisation with the MSD master database using periodic incremental scripts for all Oracle platforms

Use of two schemas, main search database and incremental

EMBL-EBI

Incremental Data Export – Import

Why Incremental Updates Implemented in server side JavaScript Data is exported as Oracle Export files organised in marts Data files on the FTP server Aim for weekly updates Mechanism flexible enough to adapt on different data martCombinations Prerequisites: Rhino, Java, Oracle-JDBC driver, oracle-export-

import The user has just to download and run the periodic incremental

import script of a data mart for his database Database version, Data version, Data mart maintenance is

controlled via the administration tables through synchronisation

EMBL-EBI

Incremental Replication Mechanism

DATA MARTS

Increment log

crontab

OracleDumpFiles

MSD Search Database

Admin Tables

Web-FTPService

PERIODICEXPORT SCRIPT

DATA MARTS

crontab

Admin Tables

PERIODICIMPORT SCRIPT

Target Database

JDBC

JDBC

EMBL-EBI

Replication overview

MSD in Oracle

SchemaExportOracle

DictionaryJDBC metadata

mySQL

postgreSQLOracle

Schema creation SQL scripts

MSD in mySQL

SchemaExport

INSERTstatements

SELECTstatements

Structure

Import Export Configuration

DataExport

Java serialised data files

DataImport

Source database

Target database

EMBL-EBI

JDBC and Java

Java is one of the best environments regarding portability

Java compiled machine code works directly on all platforms Java serialisation is machine independent

JDBC standard is well defined and detailed Maps database types to Java object types Not all implementations are full in all details

JDBC offers metadata services Easy to get information about schemas, tables and columns through

JDBC

Java offers data compression Implementing a database vendor independent export-

import is trivial Could not find one available so developed a simple

and flexible mechanism at MSD

EMBL-EBI

MSD cross-replication

Inputs JDBC metadata and Oracle dictionary Exports schema creation scripts into SQL files

Gathers information from JDBC metadata and oracle dictionary Takes care of type implementation details of the various databases

(maximum size of varchar etc) Works with standard ANSI-SQL types only (not object-types, nested

tables, blobs etc)

Exports configuration files Table, column names of target database can be different Can export subsets of the data

Exports the data in compressed java serialised arrays In data files or directly piped into the Import

mechanism

EMBL-EBI

Cross-replication details

Potentially for any relational database with ANSI-SQL support

Has been tested for PostgreSQL, MS-Access, Mckoi (java RDB)

Flexible configuration Target tables can be different different The SELECT and INSERT statements are kept in configuration files This is how merged (partitioned) tables where built

Includes support for incrementals This option is still not used in production

The information in the data files can be examined off-line

Foreign keys have to be disabled during the load

EMBL-EBI

Oracle versus mySQL

mySQL has several underlying database engines InnoDB

Transactions & referential integrity Not best performance, inefficient disk space usage

myIsam Good performance but not foreign keys

myIsam compressed Efficient I/O, good use of disk space but read-only Can’t build indexes without uncompressing

Support for VLDB’s Merged tables are similar to Oracle partitioning but implemented by

the user Harder to simulate hash partitioning, range partitioning by default Problems of using the indexes of the merged tables

Query optimiser of mySQL Compared with Oracle seems primitive

EMBL-EBI

MSD mySQL experience

We used myIsam compressed tables without any indexes

The configuration that required the less disk spaceFaster to downloadOnce the data are local users can uncompress the data and build the

recommended or any other indexes locally

We used merged tables To also avoid data files larger than 8GBAnd for performance reasons

Character-sets - collation Textual data in mySQL are by default case insensitive Only some character collations allow a similar behaviour with Oracle

Other details Table names are by default case sensitive (problem with windows-

unix file systems)Choosing the appropriate numeric type (Integer versus Numeric)

EMBL-EBI

Summary

MSD Search Database Database Replication Why Replicate Replication Overview Components of the Replication Incremental Data Export – Import Incremental Replication

Mechanism

embl-ebi database replication - distribution. embl-ebi relational public databases ebi’s mission...

Documents