graph databases in computational bioloby: case of neo4j and titandb

Download Graph databases in computational bioloby: case of neo4j and TitanDB

If you can't read please download the document

Upload: andrei-kucharavy

Post on 16-Apr-2017

4.264 views

Category:

Technology


1 download

TRANSCRIPT

Graph databasesIn computational biology:Neo4j and TitanDB

Andrei Kucharavy

23/08/2013

Rigid structure of Interactions = Interactome

Knowledge access structure = GO

Why even bother?

Those are Graphs

Why even bother?

~ 1 Gb of raw data from Reactome

~ 300 Mb of Data from Uniprot / GO / ENSEMBL/ mappings

=> this is way over the conventional 1024 Mb JVM limit => heap crash

~ 15 minutes to load

Nightmare to visualize and debug

Relational Databases

Intro to neo4j presentation jexp @ slideshare

Data models

Graph databases

Intro to neo4j presentation jexp @ slideshare

Core abstractions

Objects:Nodes (Vertexes)

Relationships between nodes (Edges)

Properties for Vertexes and Edges

Node1Node2Node3

Property1Node2Property2Property3Property1Property1Property2Property2

PropertyPropertyProperty

Core abstractions

Objects:Nodes (Vertexes)

Relationships between nodes (Edges)

Properties for Vertexes and Edges

Operations:Immediate relations

TraversalsGet the shortest path from j to k

Get the path with least weight from j to k, ...

Main advantages promised

Increased speed for graph-type applicationsAvoid join on 10M rows to get ~20 related elements

Traversals

Simplified programmingJava objects

Xml / rdf / owl

Schema alterations

Why join millions of rows if only 10 relationships are iteresting?

What to do if we want traversals

Main advantages promised

Ease of deployment / maintenance:Scalability

Complexity

Modifications

Schema migrations

neo4j

Started in 2003

Schema-free

ACID transactions

Reasonably scalable, reasonably replicatable

10 000 open source projects, 1000 commercial costumers

neo4j

Started in 2003

Schema-free

ACID transactions

10 000 open source projects, 1000 commercial costumers

100 % open source

https://github.com/neo4j/neo4j

neo4j

Started in 2003

Schema-free

ACID transactions

10 000 open source projects, 1000 commercial costumers

100 % open source

Master-slave replication

AGPL 3 license: if you are open source, it is free,Even the support

neo4j

Started in 2003

Schema-free

ACID transactions

10 000 open source projects, 1000 commercial costumers

100 % open source

Master-slave replication

AGPL 3 license: if you are open source, it is free,Even the support

Plus graphical interface => De-bug!!!!

Deployment Demo

cd to specific DB location (better as a special user)

./neo4j start

./neo4j stop

=> Serves localhost:7474

40 000 files => mainly indexes / user accesses

Under the hood

Java & JVM

Split in twoIn-RAM pre-heated v.s. Whole in-HDD

Scalability:32 G nodes / 32 G relations / 64 G properties

1 M traversals / sec, size-independent of a graph

Lucene index: instant search

Interfaces

Two-fold interface:REST server

Local instance

Specific query Language: Cipher

Interfaces

Two-fold interface:REST server

Local instance

Specific query Language: Cipher

Interoperability: support for tinkerpop stack

TinkerPop stack

REST APIs

What is Gremlin

Domain-specific graph language

Build atop GroovyJVM

Dynamically evaluated

~ scripting in java

Core = javaJava

Scala / Clojure

Jpypes / Jython / Jruby

Supported by most graph databases

Interfaces

Two-fold interface:REST server

Local instance

Specific query Language: Cipher

Interoperability: support for TinkerPop stack

Native bindings:Java

Python, PHP, Ruby / Rails, node.js, .Net

Scala, Clojure, Haskell, ...

My stack:Native Python and Python through bulbs and REST

Python + Bulbs + REST + neo4j

Bulbs = Pythonic wrapper for Gremlin

Portability(BluePrints + Rexter)Titan DB (will be discussed later on)

Bitsy

Infinite Graph

Sqrrl

ArangoDB

Class heritability and DDT:Java-like class heritability

Demo 2

Datatype declaration

GraphDB connection and declaration

Fill-in

Graphical Interface

neo4j-specific

Lucene index in the backendExact indexing => constant-time retrieval

Full-text indexing => searching partial names and adding the missing links SRC = SRC_HUMAN = SRC1

Demo3

Constant node retrieval time / internode connection distance time

Performing the partial search

Adding missing links

Neo4j server v.s. Local database

Performing simple Gremlin queries

Use Case:

Existent map of correlations:

ProteinDomainDomain TypeProtein
function

Use Case:

Existent map of correlations:

Wanted map of correlations:

ProteinDomainDomain TypeProtein
function

ProteinDomainDomain TypeProtein
function

Use Case:

Existent map of correlations:

Wanted map of correlations:

ProteinDomainDomain TypeProtein
function

ProteinDomainDomain TypeProtein
function

Use Case

SQL Python / SQLAlchemy:Create new table

Add ForeignKeys, Primary key, indexes, ...

Add the table to the data model,

Create functions for access/update,

...

Use Case

Bulbs / Neo4j => Live demo

Use case 2

In human proteome, find all chemical groups A and B separated by less then x Database Structure:Suppose all the proteins are connected to a Type node

Each protein is linked to it's domains, each domain is linked to it's amino acids, each amino-acid linked to it's chemical groups and ultimately atoms

Chemical groups have assigned distance between them and groups they are close to

AlgorithmSelect a protein of interest

Get all of it's chemical groups: 1000(a.a)*3(ch.gr/a.a)

Filter all of the Relations longer than k: 1000*3*100(possible contacts per ch.gr)

Recover the proteins: 1000*3*100*2

With 1M traversals per second => 0.6 sec. to execute the query

If TitanDB with ElasticSearch and geo-queries (all within circle of radius x), higher speeds possible

Limitations

Node Number:32 Giga Nodes / Edges is a lot on servers ~100 Tb of data

1 Unix partition

40 000 ++ simultaneously opened files (Indexes+users)

32 Giga Edges is relatively small in biology~ 43 M nodes in UniProt Only

GO x UNIPROT x EMBL x GeneNames x Interaction Maps x Localisations x names & Accesses ....

All potentially druggable molecules, all protein atoms, all atom-atom interactions

Limitations

Absence of parallelism/distributionOne process at time:1 traversal at time

ACID => Database locks

Though master-slave distribution

Single partition Replication

100 Tb + RAID!?

Though full support for AWS and VM

Limitations

Bubs: python over gremlin scriptsGremlin Groovy JVM do what you want=> SQL (Gremlin) injections

Request sanitation neededHashes of the queries without variablesPre-filtering before query referral to server

Limitations

Bulk insert not naively implemented in Bulbs:Insertion rate ~10 nodes /sec

Naive python binding tests:~60 msec for ACID compliance (HDD write)

~1.8 msec/node cold insertion routines
(HDD sequential write)

~0.3 msec/node hot write insertion routines (RAM buffer)

500 - 1500 nodes/sec if packages of 1000 6 h to fill the database up to theoretical limit

github.com/chefjerome/graphalchemy implements efficient flush based on bulbs (alpha and thus unstable right now)

Port to TitanDB

TitanDB

Hbase / Cassandra / BerkleyDB as backend

TitanDB

Hbase / Cassandra / BerkleyDB as storage backend

Lucene / ElasticSearch as Indexing backend

Served over Rexter serverFull distribution> 500 simultaneous connections (5000 is still stable)Automatic replication (Hadoop)Multiple simultaneous queriesSky is the only limit for storage quantities=> TitanDB / Hbase is stable up to 5 Pbytes in production

Neo4j for bioinformatics:
parsing and curating Reactome.org

Reactome.org: BioPax : xml / RDF / OWL

Neo4j for bioinformatics:
parsing and curating Reactome.org

Reactome.org structure: BioPax : xml / RDF / OWL

Physical entities:Proteins, small molecules, Complexes, RNA, DNA

Fragments of physical entities

Interaction:Degradation / polymerisation / Biochemical reactions

Molecular interaction

Genetic interaction

Pathways, Genes, Post-translational modifications...

Protege

Neo4j for bioinformatics:
parsing and curating Reactome.org

Reality of Reactome.org: Main connex element: ~ 22 000 entities, but 6 other with >100 elements

Presence of generic classes : groups of objects

Proteins = mix between proteins, domains, groups, groups of domains

15 000 proteins, 5000 UNIPROT references

156 genes, 56 RNA molecules => translation / transcription regulation is not well described

Neo4j for bioinformatics:
parsing and curating Reactome.org

Reality of Reactome.org: heavily comment-based: case of SRC

Neo4j for bioinformatics:
parsing and curating Reactome.org

Neo4j for bioinformatics:
parsing and curating Reactome.org

Completed with HiNT protein-protein interaction from Yue lab at Cornell

Re-indexed:SwissProt protein names

Full names from SwissProt

Gene Names

KEGG, GO, EMBL, ChEBI cross-references

PDB implemented, not re-run

Neo4j for bioinformatics:
parsing and curating Reactome.org

Example of pathway Parsing

Conclusion

Systems biology is more about graphs then about systems of tables

Graph Databases are awesome

Neo4j is terrific

TitanDB is cool

You should definitely pick one of them, load Reactome.org dataset or whatever you are interested in and play with it.

Questions ?

Thanks

Pr. Philp BournePr. Bart DeplankeCedric MerlotLi XieSpencer Blieven

Jiang WangJulia PonomarenkoCole ChristieAndreas PrilicLilia Iakoucheva