Andrew Meade ([email protected])School of Biological Sciences
Molecular sequence growth ratesfrom 600 to 100 million sequences in 25 years
Human Genome project
Molecular sequence growth rates 18 million new sequences a
year (2007 – 2008) Rate of growth is accelerating Doubling every 2 years Likely to continue with new
sequencing technology Cost, time and technical ability
required has reduced
Its worse than it looks
Lack of suitably tools for sequence analysis Analysis methods don’t always scale
linearly Methods have changed
Simple heuristics Statistical methods Simple rules More realistic models Descriptive results Biological process Sub system analysis Systems biology
Computing power a major rate limiting steep
The widening gap between data and analytical methods is increasing
Tools for genomic analysis
Current Tools Required Tools
Co-opted for purpose
Designed for smaller data sets
Limited to a single computer
External data required
Hard to generalise
Custom build
Limited by available hardware
Use available computers
Models derived from data
Identify informative information in the data
454 parallel sequencing
Fast, 400-600 million bases per 10 hours Human genome in 100 hours, HGP 13 years
Cheap, 20¢ per kb, currently $12 Human genome for $100,000, HGP $10 billion
Accurate, 99% accurate on 400th base Small chunks 400 – 800 bases per sequence Similar to parallel computing, hard to
convert raw power to usefully results The catch - analysis
454 sequencing
Sequence populations of bacteria (16s) taken from cow guts under different experiential conditions
Identify how changes in feed affects bacteria populations.
332,000 sequence in total £8,000 using 454, previously over £2
million
454 sequencing analysis
Find how closely related sequence are to each other.
Perform an approximate match between all pairs of sequences. Allowing for insertions, deletions and mutations.
332,000^2 * 0.5 = 5.5 * 1010 comparisons
874 years on a single computer Trivially parallel task, easy to distribute
over nodes, different clusters, different OS / hardware.
454 sequencing analysis 2 Cluster sequences from previous
steep to find what species are present and in what quantities
102 GB of data. Distributed code to reduce memory and processing requirements. Liner scaling (memory, CPU) up to
200 nodes Problems with disk access.
Bayesian Phylogenetic inference Infer evolutionally histories
(phylogenies) from molecular data. Widely uses in all arias for biology.
Used to investigate how genes and proteins change and adapt to their environment
How viruses spread and mutate Reconstruct ancestral genes and proteins Used in conservation studies to identify
species that are most at risk of extinction and most valuable to conserve
Mammal Mitochondrial
44 Taxa13 Protein coding regions
16400 Nucleotides
Number of computers
1 ~ 70 days60 ~ 2 days
Mammal Mitochondrial scaling
x
x
x
x