copyright © 2004 synamatix sdn bhd (538481-u) synabase tm : a novel structured-network pattern...

50
Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platfor for rage, ultra-high-throughput and sensitive data anal October 03 2006

Upload: rachael-smithee

Post on 15-Dec-2015

219 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2004 Synamatix sdn bhd (538481-U)

SynaBASETM: A novel structured-network pattern database platform

forstorage, ultra-high-throughput and sensitive data analysis

October 03 2006

Page 2: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

AimsAims

To learn about current research priorities and bioinformatics initiatives

To review Synamatix science and technologies

Demonstrate Synamatix performance capabilities

To explore potential fit and research synergies

Page 3: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Synamatix IntroductionsSynamatix IntroductionsRobert Hercus - Synamatix, MD and Inventor

Australian, over 30 years IT Sciences experiencePioneered many large-scale IT projects

Dr. Arif Anwar – Synamatix, CEOBritish, Ph.D. Oxford Uni./UCL12 yrs+ post-Ph.D. US and EU genomics background

Silicon Genetics, Becton-Dickinson-CLONTECH

Poh Yang Ming – Synamatix, Senior BioinformaticianMalaysian, B.Sc. Biotechnology, M.Sc. IT6 yrs Biotechnology industry and research

IMCB, SingaporeMUST

Johan Poole-Johnson – Synamatix, Accounts ManagerAustralian, B.Com – Murdoch University, Australia8 yrs+ Multinational and Start-up Technology Companies4 yrs Experience in science informatics

Page 4: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Core IP Patented World-wide SynaBASE™Database Platform for high-throughput

genomics

Market shifting towards

very high-throughput genomics

High-growth market

Investing heavily in Personalised Genome

and Healthcare revolution

Who’s who list of customers

USA, Europe, Australia and Singapore

Page 5: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Core competencies

Algorithm development

Software and UI

Bioinformatics and HPC know-how

Training/Support

International Collaborations

Database platform flexibility

Page 6: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

New Customers in 2006New Customers in 2006

Page 7: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Page 8: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Page 9: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Page 10: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Command line interface

CORE Database platform

SynaRex Bulk

SynaProbe Bulk

SynaSearch Bulk

SynaMer

SynaFrag

SXSequenceRefs

SXLRESearch

SXFuzzyPatternSearch

Sxpet

SXParse

Data analysi

s

Develop Tools

Another 20+ apps

Graphical Interface

Page 11: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Open platform approachOpen platform approach

Applications andResearch

User or

Synamatix

Internal/Custom

developmentModify Synamatix

Applicationsat source level

IP owned by User:

Page 12: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Why?Why?

Current database platforms will not be able to scale to manage ever increasing data volume and complexity

Novel database platform to meet needs, not a:Suffix treeRelational databaseSuffix array

Page 13: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

How?How?

Page 14: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

What do we

know about

data ?

Similarity

& association

Common PATTERNS and

functionality

Page 15: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

A T G C

A T G C A T G A A T……

AT TG GCCAGAAA AT TGAT

ATG TGC GCACAT

ATG

TGAGAA AAT

ATGC TGCA

ATGCA

GCATCATG

TGCAT

Page 16: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

1. SynaBASE is very efficient – scales very 1. SynaBASE is very efficient – scales very wellwell

When more data is added the increase is not proportional as sub-patterns may already exist

Only adding leaf nodes, references are stored

More efficient with more data

Every overlapping pattern, at every position is stored

Patterns are extended until they become unique

Page 17: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

0

50

100

150

200

250

0 20 40 60 80 100

Number of Streptococcus pneumoniae r6 genomes

Dat

abas

e s

ize (

Mbyt

es) S. pneumoniae R6 genome size = 2.068 Mbytes

SynaBASE

Flat file

1

Page 18: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

2. SynaBASE enables very fast access2. SynaBASE enables very fast access

Number of levels smallFor a query:

Match 1st longest patternFollow Eulerian path through network, picking up longest matching pattern for each posn. In query

Processing time is:Proportional to query size to obtain all unique subpatterns

A C T

AA AC CT TC

AAC ACT CTC

AACT ACTC

AACTC ACTCG

CTCG

CTCGA

TCGA

Page 19: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Q* logN base AQ* logN base A

Size of database

Speed milliseconds

1 10 100 1000

100

200

300

400

500

600

700

800

900

Conventional

SynaBASE

Page 20: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study - Comparison of Human v Mouse genome

3 yrs

SynaBASE BLAST

6h

PatternHunter

22days

Page 21: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

3. Increased sensitivity3. Increased sensitivity

Page 22: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

BLASTN vs. SynaSearch-BulkBLASTN vs. SynaSearch-BulkCumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities

SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences

Novel hits

Page 23: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

The elephant and the giraffe walked up the mountain

The elephant and the giraffe walked up the mountain

A graph showing Frequency of  “string (word)” patterns in a sentence does not reflect meaning

A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning

4. Novel annotation using SynaBASE4. Novel annotation using SynaBASE

Page 24: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Sig(a1a2a3) =

F(a1a2a3) / Ef(a1a2a3)

= Fr(a1a2a3) * F(a2)

F(a1a2) * F(a2a3)

a1 a2 a3

a1a2 a2a3

a1a2a3

Expected Frequency

Ef(a1a2a3) =

F(a1a2) * F(a2a3) F(a2)

Actual Freq/Expec Freq

SIGNIFICANCE

Page 25: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Gene models correlate with “Gene models correlate with “SIGNIFICANCE”SIGNIFICANCE”

Page 26: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

On-going Research Case Studies &

Performance Demonstration

Page 27: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 1 – contamination identificationCase Study 1 – contamination identification

High-throughput identification of contaminant reads on the basis of over-representation in a SynaBASE

Major problem as vector databases incomplete and/or not updated

Causes bottlenecks in sequence finishing pipeline

Page 28: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

1. Build SynaBASE of 5239 Lamprey sequences using SXBuild

SXPet

Analysis Steps

3. Filter patterns to remove polynucleotide repeats of more

than 75% identical base composition

SXPET:A SynaBASE API call for reporting

patterns based on frequency

475 patterns removed resultingIn 17,914 Lamprey patterns

SXBuild

Function definitions

SXBuildA SynaBASE API

call for building SynaBASEs from

Raw sequence data

SynaBASE identifies 18,389 patterns

2. Extract patterns of length 40mer and above using

SXPet

Page 29: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Verification (optional)

Bulksearch

Map patterns back againstvector source references

Unique vector contaminated sequences:

3374 / 5239 (60%)

Function definitions

Bulksearch:A SynaBASE API

call for batch searching of sequences

Search resulting 17914 patterns against UniVec SynaBASE

By using an approach based upon filtering of over represented patterns in SynaBASE, 100% of the vector contaminants sequences are identified.

This obviates the requirement for using the UNIVEC database for screening in 1 step.

Page 30: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 2 – OverlapperCase Study 2 – Overlapper

Page 31: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Task to accomplishTask to accomplish

Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m

Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps

Hence standard approach limits usage to 32mers

Longer mers help bridge repetitive and low-complexity regions

Page 32: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Long v Short n-mersLong v Short n-mersadvantages and disadvantages

100 mer

+ve

-ve

Fewer false positives

Improvement in final assembly

Errors in reads may lead to false negatives

Slow to process with conventional software

Page 33: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Explanation of advantagesExplanation of advantages

Low-complexity region

A shorter overlap results in more false

positives

A longer overlap results in less false

positives

Final assembly improved

A

B

Page 34: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers

Page 35: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

ConclusionsConclusions

For 30million 1kb reads took 5 hours on a dual CPU itanium

machine, with temporary file size less than 200GB

Time consumed to find overlapping sequences for 33000

900bp reads of a bacterial WGSS reads took less than 20s

100 fold faster than conventional method

Allows use of longer n-mers

Potentially increases quality of assembly

SynaMer will be made released as a product later this

Summer

Page 36: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 3 – 454 Life sciencesCase Study 3 – 454 Life sciences

Rapid genome assembly from 454 generated reads

Page 37: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Conventional approach to Genome Conventional approach to Genome AssemblyAssembly

Cluster by sequence overlaps

Filter out repeats and detectable errors

Assemble each cluster into one or more contigs

Derive contig consensus

Validate results by comparison to a reference genome sequence (if available)

Page 38: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

FragBASE – using the SynaBASE structure….FragBASE – using the SynaBASE structure….

Select patterns of high coverage

Use corrected FragBASE

Use FragBASE network* to extend patterns

Increase pattern size to overcome shorter repeat sections

Page 39: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Stage 3 - error correctionStage 3 - error correction

Build a database of patterns - FragBASE

Compared patterns M.

Genitalium and analyse

Database consists of:

Total patterns – f/rGenitalium patterns – f/rError patterns – f/r

Fragments

Correct errors using significance

Corrected fragments

Page 40: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

454 assembly result454 assembly result

400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correctionGenome coverage 99.89%

Page 41: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 4 - Plant Comparative GenomicsCase Study 4 - Plant Comparative Genomics

Refseq plant release Covers complete and partially sequenced genomes74 898 419 bp in 205 780 sequencesGenerate Sequence alignmentsSequence-based clustering using common K-mers Whole genome phylogeny

Page 42: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Performance ResultsPerformance Results

Page 43: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Sequence clustering based on shared K-mersSequence clustering based on shared K-mers

Page 44: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case study 5 - Pattern Frequency Case study 5 - Pattern Frequency statistics and SynaBASEstatistics and SynaBASE

SynaBASE stores all patterns from dataPattern frequencies and offsets on source sequencesCharacterize/annotate data Sequence clusteringConserved regionsSimple and Complex repeats Genome segmental duplications

Page 45: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Yeast Genes SynaBASE Frequency StatisticsYeast Genes SynaBASE Frequency Statistics

Page 46: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Arabidopsis thaliana (thale cress)Arabidopsis thaliana (thale cress)

Page 47: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

HumanHuman

Page 48: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

Mus musculusMus musculus

Page 49: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

All Bacteria genomesAll Bacteria genomes

Page 50: Copyright © 2004 Synamatix sdn bhd (538481-U) SynaBASE TM : A novel structured-network pattern database platform for storage, ultra-high-throughput and

Copyright © 2005 Synamatix sdn bhd (538481-U)

SummarySummary

Cutting-edge Bioinformatics: SynaBASE novel database PLATFORM

UniquePatented worldwideLeads to massive increases in speed and scalabilityAccuracy and sensitivity enhanced