embl-ebi database replication - distribution. embl-ebi relational public databases ebi’s mission...
TRANSCRIPT
EMBL-EBI
Database
Replication - Distribution
EMBL-EBI
Relational public databases
EBI’s mission to provide freely accessible information on the public domain
Data formats and technologies, should not contradict to this policy
Adopt widely accepted, successful standards that are well known and used
Free access not only in the information content, but in the supporting technologies
Reasonable investment in resources and expertise by users so that the data is accessible to a wider audience
But without a severe restriction to the benefits to the usersA trade-off situation, different users, different needs
Relational databases are an industry standardVendors have different implementations but there are underlying
formal standardsANSI-SQL for query expressionODBC, JDBC for API’s
EMBL-EBI
RDB’s versus flat files
Relational databases are flexible, powerful and consistent
They are a lot more complex They impose data organisation that can’t be easily
vertically partitioned Organising and inter-exchanging data on a per-entry basis does not
come by default
Physical implementations are not standard Remember the days (or imagine) flat files without a common
character encoding standard (without ASCII around)
Vendors support migration of other databases to their own but not the other way-round
There is not a common vendor-independent exchange or dump format
This is not trivial due to differences in implementation details and extensions on the standards
EMBL-EBI
Why Replicate?
To take advantage of local hardware and CPU time – some operations are simply not possible on-line
To avoid continuous dependency on network and EBI resources
To extend or merge information with other databases or data sources
To utilise the information in new innovative ways
To ensure confidentiality of research
EMBL-EBI
MSD replication options
We offer MSDSD in OracleWith indexes pre-builtImplementation uses Oracle import-exportWith frequent (weekly) incrementals so that new entries are
becoming available soonUsers need to have Oracle licenceWe have more experience and offer better support
Or in mySQLIn compressed myIsam format without indexesWe give directly the mySQL data-files (they are platform and version
independent)We don’t offer weekly increments but new full releases every few
monthsWe recommend the Oracle distribution for advanced usersBut mySQL is great if they can’t afford OracleOr want to evaluate the MSDSD database
EMBL-EBI
Replication Components
Database copy on Sun Solaris
Schema export-import plus sql-loader files for creating the database initially for Oracle on other platforms
Possibility to Import to Non Oracle databases (MySQL)
Periodic synchronisation with the MSD master database using periodic incremental scripts for all Oracle platforms
Use of two schemas, main search database and incremental
EMBL-EBI
Incremental Data Export – Import
Why Incremental Updates Implemented in server side JavaScript Data is exported as Oracle Export files organised in marts Data files on the FTP server Aim for weekly updates Mechanism flexible enough to adapt on different data martCombinations Prerequisites: Rhino, Java, Oracle-JDBC driver, oracle-export-
import The user has just to download and run the periodic incremental
import script of a data mart for his database Database version, Data version, Data mart maintenance is
controlled via the administration tables through synchronisation
EMBL-EBI
Incremental Replication Mechanism
DATA MARTS
Increment log
crontab
OracleDumpFiles
MSD Search Database
Admin Tables
Web-FTPService
PERIODICEXPORT SCRIPT
DATA MARTS
crontab
Admin Tables
PERIODICIMPORT SCRIPT
Target Database
JDBC
JDBC
EMBL-EBI
Replication overview
MSD in Oracle
SchemaExportOracle
DictionaryJDBC metadata
mySQL
postgreSQLOracle
Schema creation SQL scripts
MSD in mySQL
SchemaExport
INSERTstatements
SELECTstatements
Structure
Import Export Configuration
DataExport
Java serialised data files
DataImport
Source database
Target database
EMBL-EBI
JDBC and Java
Java is one of the best environments regarding portability
Java compiled machine code works directly on all platforms Java serialisation is machine independent
JDBC standard is well defined and detailed Maps database types to Java object types Not all implementations are full in all details
JDBC offers metadata services Easy to get information about schemas, tables and columns through
JDBC
Java offers data compression Implementing a database vendor independent export-
import is trivial Could not find one available so developed a simple
and flexible mechanism at MSD
EMBL-EBI
MSD cross-replication
Inputs JDBC metadata and Oracle dictionary Exports schema creation scripts into SQL files
Gathers information from JDBC metadata and oracle dictionary Takes care of type implementation details of the various databases
(maximum size of varchar etc) Works with standard ANSI-SQL types only (not object-types, nested
tables, blobs etc)
Exports configuration files Table, column names of target database can be different Can export subsets of the data
Exports the data in compressed java serialised arrays In data files or directly piped into the Import
mechanism
EMBL-EBI
Cross-replication details
Potentially for any relational database with ANSI-SQL support
Has been tested for PostgreSQL, MS-Access, Mckoi (java RDB)
Flexible configuration Target tables can be different different The SELECT and INSERT statements are kept in configuration files This is how merged (partitioned) tables where built
Includes support for incrementals This option is still not used in production
The information in the data files can be examined off-line
Foreign keys have to be disabled during the load
EMBL-EBI
Oracle versus mySQL
mySQL has several underlying database engines InnoDB
Transactions & referential integrity Not best performance, inefficient disk space usage
myIsam Good performance but not foreign keys
myIsam compressed Efficient I/O, good use of disk space but read-only Can’t build indexes without uncompressing
Support for VLDB’s Merged tables are similar to Oracle partitioning but implemented by
the user Harder to simulate hash partitioning, range partitioning by default Problems of using the indexes of the merged tables
Query optimiser of mySQL Compared with Oracle seems primitive
EMBL-EBI
MSD mySQL experience
We used myIsam compressed tables without any indexes
The configuration that required the less disk spaceFaster to downloadOnce the data are local users can uncompress the data and build the
recommended or any other indexes locally
We used merged tables To also avoid data files larger than 8GBAnd for performance reasons
Character-sets - collation Textual data in mySQL are by default case insensitive Only some character collations allow a similar behaviour with Oracle
Other details Table names are by default case sensitive (problem with windows-
unix file systems)Choosing the appropriate numeric type (Integer versus Numeric)
EMBL-EBI
Summary
MSD Search Database Database Replication Why Replicate Replication Overview Components of the Replication Incremental Data Export – Import Incremental Replication
Mechanism