bioinformatics for high-throughput dna sequencing gabor marth boston college biology new grad...
Post on 19-Dec-2015
217 views
TRANSCRIPT
Bioinformatics for high-throughput DNA sequencing
Gabor MarthBoston College Biology
New grad student orientationBoston CollegeSeptember 8, 2009
DNA sequence variations
The Human Genome Project has determined a reference sequence of the human genome
However, every individual is unique, and is different from others at millions of nucleotide locations
Why do we care about variations?
underlie phenotypic differences
cause inherited diseases
allow tracking ancestral human history
4
Human genetic variation
The first “famous” genomes
Genome sequencing
~1 Mb ~100 Mb >100 Mb ~3,000 Mb
New sequencing technologies…
Next-gen sequencing – a revolution
read length
base
s per
mach
ine r
un
10 bp 1,000 bp100 bp
1 Gb
100 Mb
10 Mb
10 Gb
Illumina/Solexa, AB/SOLiD sequencers
ABI capillary sequencer
Roche/454 pyrosequencer
(100-400 Mb in 200-450 bp reads)
(10-30Gb in 25-100 bp reads)
1 Mb
100 Gb
The re-sequencing informatics pipeline
REF
(ii) read mapping
IND
(i) base calling
IND(iii) SNP and short INDEL calling
(v) data viewing, hypothesis generation
(iv) SV callingGigaBayesGigaBayes
Tools
Read mapping is like a jigsaw puzzle…
… and they give you the picture on the box
2. Read mapping
…you get the pieces…
Big and Unique pieces are easier to place than others…
The MOSAIK read mapping program
• Reads from repeats cannot be uniquely mapped back to their true region of origin
Michael Strömberg(Wan-Ping Lee)
SNP discovery
GigaBayesGigaBayes
Marth et al. Nature Genetics 1999Quinlan et al. in prep.(Amit Indap, Wen Fung Leong)
Structural variation discovery
Navigation bar
Fragment lengths in selected region
Depth of coverage in selected region
Stewart et al. in prep.(Deniz Kural, Jiantao Wu)
Sequence alignment viewers
Huang et al. Genome Research 2008(Derek Barnett)
Data mining
Mutational profiling in deep 454 data
• Pichia stipitis is a yeast that efficiently converts xylose to ethanol (bio-fuel production)• one specific mutagenized strain had especially high conversion efficiency• goal was to determine where the mutations were that caused this phenotype• we analyzed 10 runs (~3 million reads) of 454 reads (~20x coverage of the 15MB
genome)
Pichia stipitis reference sequence
• found 39 mutations• informatics analysis in < 24 hours (including manual checking of all candidates)
Image from JGI web site
Smith et al. Genome Research 2008
SNP calling in short-read coverage
C. elegans reference genome (Bristol, N2 strain)
Pasadena, CB4858(1 ½ machine runs)
Bristol, N2 strain(3 ½ machine runs)
• goal was to evaluate the Solexa/Illumina technology for the complete resequencing of large model-organism genomes• 5 runs (~120 million) Illumina reads were collected by Washington Univ.
SNP
• we found 45,000 SNP with very high validation rate
Hillier et al.Nature Methods 2008
Current focus
1000 Genomes Project
• data quality assessment• project design (# samples depth of read coverage)• read mapping• SNP calling• structural variation discovery
SV discovery in autism
deletion
amplification
Lab
People
Resources
• computer cluster (72 servers)• 128 GB RAM server• ~200TB disk space
• 2 R01 grants (NHGRI/NIH)• 1 R21 grant (NIAID/NIH)• a BC RIG grant
• 2 RC2 grants (NHGRI/NIH) starting September 2009
Collaborations
Baylor HGSC
Wash. U. GSC
Genome Canada
UBC GSC
Cornell
UC Davis UCSF
NCBI @ NIH NCI @ NIH Marshfield Clinic
UCLA
Pfizer
Graduate student rotations
• Looking for new graduate students
• Spots are available for all three rotations
• Lots or projects
• Caveat: you need to be able to program…
• Check us out at: http://bioinformatics.bc.edu/marthlab/
• If you are interested, please talk to me