copyright © 2004 synamatix sdn bhd (538481-u) synabase tm : a novel structured-network pattern...

Post on 15-Dec-2015

219 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Copyright © 2004 Synamatix sdn bhd (538481-U)

SynaBASETM: A novel structured-network pattern database platform

forstorage, ultra-high-throughput and sensitive data analysis

October 03 2006

Copyright © 2005 Synamatix sdn bhd (538481-U)

AimsAims

To learn about current research priorities and bioinformatics initiatives

To review Synamatix science and technologies

Demonstrate Synamatix performance capabilities

To explore potential fit and research synergies

Copyright © 2005 Synamatix sdn bhd (538481-U)

Synamatix IntroductionsSynamatix IntroductionsRobert Hercus - Synamatix, MD and Inventor

Australian, over 30 years IT Sciences experiencePioneered many large-scale IT projects

Dr. Arif Anwar – Synamatix, CEOBritish, Ph.D. Oxford Uni./UCL12 yrs+ post-Ph.D. US and EU genomics background

Silicon Genetics, Becton-Dickinson-CLONTECH

Poh Yang Ming – Synamatix, Senior BioinformaticianMalaysian, B.Sc. Biotechnology, M.Sc. IT6 yrs Biotechnology industry and research

IMCB, SingaporeMUST

Johan Poole-Johnson – Synamatix, Accounts ManagerAustralian, B.Com – Murdoch University, Australia8 yrs+ Multinational and Start-up Technology Companies4 yrs Experience in science informatics

Copyright © 2005 Synamatix sdn bhd (538481-U)

Core IP Patented World-wide SynaBASE™Database Platform for high-throughput

genomics

Market shifting towards

very high-throughput genomics

High-growth market

Investing heavily in Personalised Genome

and Healthcare revolution

Who’s who list of customers

USA, Europe, Australia and Singapore

Copyright © 2005 Synamatix sdn bhd (538481-U)

Core competencies

Algorithm development

Software and UI

Bioinformatics and HPC know-how

Training/Support

International Collaborations

Database platform flexibility

Copyright © 2005 Synamatix sdn bhd (538481-U)

New Customers in 2006New Customers in 2006

Copyright © 2005 Synamatix sdn bhd (538481-U)

Copyright © 2005 Synamatix sdn bhd (538481-U)

Copyright © 2005 Synamatix sdn bhd (538481-U)

Copyright © 2005 Synamatix sdn bhd (538481-U)

Command line interface

CORE Database platform

SynaRex Bulk

SynaProbe Bulk

SynaSearch Bulk

SynaMer

SynaFrag

SXSequenceRefs

SXLRESearch

SXFuzzyPatternSearch

Sxpet

SXParse

Data analysi

s

Develop Tools

Another 20+ apps

Graphical Interface

Copyright © 2005 Synamatix sdn bhd (538481-U)

Open platform approachOpen platform approach

Applications andResearch

User or

Synamatix

Internal/Custom

developmentModify Synamatix

Applicationsat source level

IP owned by User:

Copyright © 2005 Synamatix sdn bhd (538481-U)

Why?Why?

Current database platforms will not be able to scale to manage ever increasing data volume and complexity

Novel database platform to meet needs, not a:Suffix treeRelational databaseSuffix array

Copyright © 2005 Synamatix sdn bhd (538481-U)

How?How?

Copyright © 2005 Synamatix sdn bhd (538481-U)

What do we

know about

data ?

Similarity

& association

Common PATTERNS and

functionality

Copyright © 2005 Synamatix sdn bhd (538481-U)

A T G C

A T G C A T G A A T……

AT TG GCCAGAAA AT TGAT

ATG TGC GCACAT

ATG

TGAGAA AAT

ATGC TGCA

ATGCA

GCATCATG

TGCAT

Copyright © 2005 Synamatix sdn bhd (538481-U)

1. SynaBASE is very efficient – scales very 1. SynaBASE is very efficient – scales very wellwell

When more data is added the increase is not proportional as sub-patterns may already exist

Only adding leaf nodes, references are stored

More efficient with more data

Every overlapping pattern, at every position is stored

Patterns are extended until they become unique

Copyright © 2005 Synamatix sdn bhd (538481-U)

0

50

100

150

200

250

0 20 40 60 80 100

Number of Streptococcus pneumoniae r6 genomes

Dat

abas

e s

ize (

Mbyt

es) S. pneumoniae R6 genome size = 2.068 Mbytes

SynaBASE

Flat file

1

Copyright © 2005 Synamatix sdn bhd (538481-U)

2. SynaBASE enables very fast access2. SynaBASE enables very fast access

Number of levels smallFor a query:

Match 1st longest patternFollow Eulerian path through network, picking up longest matching pattern for each posn. In query

Processing time is:Proportional to query size to obtain all unique subpatterns

A C T

AA AC CT TC

AAC ACT CTC

AACT ACTC

AACTC ACTCG

CTCG

CTCGA

TCGA

Copyright © 2005 Synamatix sdn bhd (538481-U)

Q* logN base AQ* logN base A

Size of database

Speed milliseconds

1 10 100 1000

100

200

300

400

500

600

700

800

900

Conventional

SynaBASE

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study - Comparison of Human v Mouse genome

3 yrs

SynaBASE BLAST

6h

PatternHunter

22days

Copyright © 2005 Synamatix sdn bhd (538481-U)

3. Increased sensitivity3. Increased sensitivity

Copyright © 2005 Synamatix sdn bhd (538481-U)

BLASTN vs. SynaSearch-BulkBLASTN vs. SynaSearch-BulkCumulative Number of hits shows SynaSearch Bulk found extra hits at low-mid identities

SynaBASE and Blast DB of 700000 Bacterial ORFs queried with 100 1kb sequences

Novel hits

Copyright © 2005 Synamatix sdn bhd (538481-U)

The elephant and the giraffe walked up the mountain

The elephant and the giraffe walked up the mountain

A graph showing Frequency of  “string (word)” patterns in a sentence does not reflect meaning

A graph showing Probabilities of predicting Precessor and Successor Characters/events (string Significance) reflecting meaning

4. Novel annotation using SynaBASE4. Novel annotation using SynaBASE

Copyright © 2005 Synamatix sdn bhd (538481-U)

Sig(a1a2a3) =

F(a1a2a3) / Ef(a1a2a3)

= Fr(a1a2a3) * F(a2)

F(a1a2) * F(a2a3)

a1 a2 a3

a1a2 a2a3

a1a2a3

Expected Frequency

Ef(a1a2a3) =

F(a1a2) * F(a2a3) F(a2)

Actual Freq/Expec Freq

SIGNIFICANCE

Copyright © 2005 Synamatix sdn bhd (538481-U)

Gene models correlate with “Gene models correlate with “SIGNIFICANCE”SIGNIFICANCE”

Copyright © 2005 Synamatix sdn bhd (538481-U)

On-going Research Case Studies &

Performance Demonstration

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 1 – contamination identificationCase Study 1 – contamination identification

High-throughput identification of contaminant reads on the basis of over-representation in a SynaBASE

Major problem as vector databases incomplete and/or not updated

Causes bottlenecks in sequence finishing pipeline

Copyright © 2005 Synamatix sdn bhd (538481-U)

1. Build SynaBASE of 5239 Lamprey sequences using SXBuild

SXPet

Analysis Steps

3. Filter patterns to remove polynucleotide repeats of more

than 75% identical base composition

SXPET:A SynaBASE API call for reporting

patterns based on frequency

475 patterns removed resultingIn 17,914 Lamprey patterns

SXBuild

Function definitions

SXBuildA SynaBASE API

call for building SynaBASEs from

Raw sequence data

SynaBASE identifies 18,389 patterns

2. Extract patterns of length 40mer and above using

SXPet

Copyright © 2005 Synamatix sdn bhd (538481-U)

Verification (optional)

Bulksearch

Map patterns back againstvector source references

Unique vector contaminated sequences:

3374 / 5239 (60%)

Function definitions

Bulksearch:A SynaBASE API

call for batch searching of sequences

Search resulting 17914 patterns against UniVec SynaBASE

By using an approach based upon filtering of over represented patterns in SynaBASE, 100% of the vector contaminants sequences are identified.

This obviates the requirement for using the UNIVEC database for screening in 1 step.

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 2 – OverlapperCase Study 2 – Overlapper

Copyright © 2005 Synamatix sdn bhd (538481-U)

Task to accomplishTask to accomplish

Original user data set and requirement was:To find all overlapping exact 100-mers in 50million 1kb sequencing reads – i.e. 50 Billion bpReport n-mers that have a frequency >2 and <m

Using conventional software and approaches the user took 500hrs and 1.5TB of disc space to find all 100-mer overlaps

Hence standard approach limits usage to 32mers

Longer mers help bridge repetitive and low-complexity regions

Copyright © 2005 Synamatix sdn bhd (538481-U)

Long v Short n-mersLong v Short n-mersadvantages and disadvantages

100 mer

+ve

-ve

Fewer false positives

Improvement in final assembly

Errors in reads may lead to false negatives

Slow to process with conventional software

Copyright © 2005 Synamatix sdn bhd (538481-U)

Explanation of advantagesExplanation of advantages

Low-complexity region

A shorter overlap results in more false

positives

A longer overlap results in less false

positives

Final assembly improved

A

B

Copyright © 2005 Synamatix sdn bhd (538481-U)

Using SynaMer there is no time Using SynaMer there is no time increase with longer n-mersincrease with longer n-mers

Copyright © 2005 Synamatix sdn bhd (538481-U)

ConclusionsConclusions

For 30million 1kb reads took 5 hours on a dual CPU itanium

machine, with temporary file size less than 200GB

Time consumed to find overlapping sequences for 33000

900bp reads of a bacterial WGSS reads took less than 20s

100 fold faster than conventional method

Allows use of longer n-mers

Potentially increases quality of assembly

SynaMer will be made released as a product later this

Summer

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 3 – 454 Life sciencesCase Study 3 – 454 Life sciences

Rapid genome assembly from 454 generated reads

Copyright © 2005 Synamatix sdn bhd (538481-U)

Conventional approach to Genome Conventional approach to Genome AssemblyAssembly

Cluster by sequence overlaps

Filter out repeats and detectable errors

Assemble each cluster into one or more contigs

Derive contig consensus

Validate results by comparison to a reference genome sequence (if available)

Copyright © 2005 Synamatix sdn bhd (538481-U)

FragBASE – using the SynaBASE structure….FragBASE – using the SynaBASE structure….

Select patterns of high coverage

Use corrected FragBASE

Use FragBASE network* to extend patterns

Increase pattern size to overcome shorter repeat sections

Copyright © 2005 Synamatix sdn bhd (538481-U)

Stage 3 - error correctionStage 3 - error correction

Build a database of patterns - FragBASE

Compared patterns M.

Genitalium and analyse

Database consists of:

Total patterns – f/rGenitalium patterns – f/rError patterns – f/r

Fragments

Correct errors using significance

Corrected fragments

Copyright © 2005 Synamatix sdn bhd (538481-U)

454 assembly result454 assembly result

400,000 reads assembled into 11 contigs in 11 minutes, 2 minutes for error correctionGenome coverage 99.89%

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case Study 4 - Plant Comparative GenomicsCase Study 4 - Plant Comparative Genomics

Refseq plant release Covers complete and partially sequenced genomes74 898 419 bp in 205 780 sequencesGenerate Sequence alignmentsSequence-based clustering using common K-mers Whole genome phylogeny

Copyright © 2005 Synamatix sdn bhd (538481-U)

Performance ResultsPerformance Results

Copyright © 2005 Synamatix sdn bhd (538481-U)

Sequence clustering based on shared K-mersSequence clustering based on shared K-mers

Copyright © 2005 Synamatix sdn bhd (538481-U)

Case study 5 - Pattern Frequency Case study 5 - Pattern Frequency statistics and SynaBASEstatistics and SynaBASE

SynaBASE stores all patterns from dataPattern frequencies and offsets on source sequencesCharacterize/annotate data Sequence clusteringConserved regionsSimple and Complex repeats Genome segmental duplications

Copyright © 2005 Synamatix sdn bhd (538481-U)

Yeast Genes SynaBASE Frequency StatisticsYeast Genes SynaBASE Frequency Statistics

Copyright © 2005 Synamatix sdn bhd (538481-U)

Arabidopsis thaliana (thale cress)Arabidopsis thaliana (thale cress)

Copyright © 2005 Synamatix sdn bhd (538481-U)

HumanHuman

Copyright © 2005 Synamatix sdn bhd (538481-U)

Mus musculusMus musculus

Copyright © 2005 Synamatix sdn bhd (538481-U)

All Bacteria genomesAll Bacteria genomes

Copyright © 2005 Synamatix sdn bhd (538481-U)

SummarySummary

Cutting-edge Bioinformatics: SynaBASE novel database PLATFORM

UniquePatented worldwideLeads to massive increases in speed and scalabilityAccuracy and sensitivity enhanced

top related