developing accessible application software for individual de novo genome projects
Post on 03-Feb-2016
23 Views
Preview:
DESCRIPTION
TRANSCRIPT
Developing Accessible Application Software for Individual de novo Genome Projects
Vince Forgetta, PhD CandidateKen Dewar PhD, Supervisor
Department of Human Genetics, McGill University Montreal, Quebec, Canada
December 8th, 2011
Next-Gen Gap
Bacterial genome in < 1 week for ~ $3000
(Nature Methods 6, S2 - S5 (2009))
(Genome Assembly)+
“Unfortunately, the software and computer hardware demands on these analyses are not much less than those of the large Genome Centers. From this perspective, the gap between large-scale genome centers and individual investigators may seem to be growing, not shrinking, as the next-generation platforms’ apparent promise of a ‘Genome Center in a box’ may have only been half delivered, providing data without a full suite of tools.”
Download Data Learn *NIX Install Software and Dependencies Run Software … Wait? … Problems?
Three Common Methodologies in de novo Genome Analysis
3
1. Display and analysis of genome annotations
2. Quality assessment of a genome assembly
3. Comparison and mining of genomic data from public repositories.
Project Software Methodology
C. difficile 14 Genome Comparison cgb 1. Genome Display
Multi-centre WGS of O. novo-ulmi ContiGo 2. Assembly QA
E. fergusonii ECD-227 BLAST in Pivot 3. Data Mining
One or more methodologies used to address needs in three specific projects; projects used as a vehicle to develop software:
Assembly Quality Assessment
Assembly Analysis
• Researchers should have easy access to determine quality and perform simple analysis.
Researcher Sequencing Centre
DNA
Assembly
• Delays and limits on data access exist: - Viewers need to be installed and have specific software (e.g. Linux) or hardware requirements (e.g. RAM).- Assembly data (multiple GBs) must be downloaded.
Objective
• Develop a simple assembly viewer that operates within a web-browser, allowing a researcher to rapidly analyze and access their data.
MethodParser/Converter: Used python to parse, analyze, and convert assembly data into web accessible formats (HTML, JSON, JPG images) which are stored on sequence centre servers.
Interface: Use browser-based interface (HTML) to dynamically access data (Javascript) on servers. Incorporates pre-existing web-technologies (JQuery, Seadragon Deepzoom AJAX).
Usage: - after genome assembly, parser/converter is run on
sequencing center servers- researcher accesses interface over the internet using a modern web browser
PerformanceParser/Converter:
– Multiple platforms (Windows/OS X/Linux) – Multi-processor support.– Low memory usage (< 250Mb of memory per processor).
User interface:– Client-side programming decreased server load– Data is downloaded is on-demand limited bandwidth
users.– Sole system requirement: a modern web-browser (Firefox,
Opera, Google Chrome) ease of installation.– Low memory usage (peaks at ~ 250 Mb).
The Interface
Table of contig/scaffold statistics:•Sortable/Filter by column•Access to contig sequence/quality and read sequences.
Assembly statistics, batch download of sequence and statistical data.
Dynamic Charts:• toggle axis value• identify points• summarize regions
Contig Assembly:-Pan/Zoom- Identify position, read names, mismatches
Demo
3. Data Mining
Microsoft Research Summer InternshipMicrosoft Biology FoundationRedmond, Washington, USA
Mentor - Simon Mercer
Microsoft Research Summer InternshipMicrosoft Biology FoundationRedmond, Washington, USA
Mentor - Simon Mercer
BLAST
BLAST Pivot
Pivot
blip.codeplex.com
BLAST
ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT
?? Species, Function, …Species, Function, …
NCBI
Local
blip.codeplex.com
Limitation
=
=
+
+~5000 genes
E. coli
ScientistScientist
ProgrammerProgrammer
>gi|301326298|ref|ZP_07219671.1| TIM-barrel protein, nifR3 family [Escherichia coli MS 78-1] Length=321
Score = 583.563 bits (1503), Expect = 8.65371E-165 Identities = 280/281 (100%), Positives = 280/281 (100%), Gaps = 0/281 (0%) Frame = 0
Query 1 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 60 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC Sbjct 41 MMSSNPQVWESDKSRLRMVHIDEPGIRTVQIAGSDPKEMADAARINVESGAQIIDINMGC 100
Query 61 PAKKVNRKLAGSALLQYPDVVKSILTEVVNAVDVPVTLKIRTGWAPEHRNCEEIAQLAED 120 PAKKVNRKLAGSALLQYPDVVKSILTEVVN VDVPVTLKIRTGWAPEHRNCEEIAQLAED Sbjct 101 PAKKVNRKLAGSALLQYPDVVKSILTEVVNTVDVPVTLKIRTGWAPEHRNCEEIAQLAED 160
Query 121 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 180 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA Sbjct 161 CGIQALTIHGRTRACLFNGEAEYDSIRAVKQKVSIPVIANGDITDPLKARAVLDYTGADA 220
Query 181 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 240 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR Sbjct 221 LMIGRAAQGRPWIFREIQHYLDTGELLPPLPLAEVKRLLCAHVRELHDFYGPAKGYRIAR 280
Query 241 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA Sbjct 281 KHVSWYLQEHAPNDQFRRTFNAIEDASEQLEALEAYFENFA 321
blip.codeplex.com
Blast in Pivot
2 3
ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT
??
BLASTBLAST Pivo
tPivo
t
ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT
??ACGTCACTGACTGACTAGCTAGCTAGCTAGCATCGATCGATCGATCGATCGATCGACGTAACTAGCACGACTGACTCT
??1
blip.codeplex.com
E. coli ECD227
E. coli
?????
E. coli ECD-227
AcknowledgementMoussa Diarra, Heidi Rempel
Species?
Function?
Antibiotic
Resistant!
Divergent Strain
blip.codeplex.com
Demo
Conclusions
ContiGo: used by clients of the Genome Centre at McGill (release soon). BL!P: >500 downloads (blip.codeplex.com).
18
C. difficileKen Dewar
Andre Dascal Matthew Oughton
Joana DiasGary Leveque
Pascale MarquisCorina Nagy
Amelie VilleneuveIvan Brukner, Mark Miller
Vivian LooMike MulveyDale GerdingMaya RupnikElaine Mardis
V. MagriniM. Hickenbotham
K. HaubC. MarkovicJ. Nelson
19
Ophiostoma novo-ulmiJan KieleczawaMichael ZianniRobert Steen
Deborah GroveAnoja Perera
Robert Lyons Jr.Sushmita SinghDoug BintzlerScottie AdamsDeborah GroveGregory Grove
Robert Lyons Jr. Suzanne Genik
Chris WrightAlvaro HernandezSharon Bachman
Lorie HetrickSushmita Singh
Nichole PetersonGary Leveque
Joana DiasClotilde Teiling Tim Harkins
E. coli ECD-227H. Rempel
Andrew MetcalfeM. S. Diarra
BL!P/Microsoft
Simon Mercer
Xin-Yi Chua
Mauro Luigi Drago
Beatriz Diaz Acosta
Vivek Kumar
Bob Davidson
Mike ZyskowskiXiaoji Chen
Bob SilversteinVikram BapatJared Jackson
Wei LuThe Pivot Team
Acknowledgements
top related