can python save next generaon sequencing? · 2014-02-02 · python save next generaon sequencing?...
Post on 26-May-2020
13 Views
Preview:
TRANSCRIPT
CanPythonSaveNextGenera1onSequencing?
ChrisMuellerLifeTechnologiesJune30,2010
SciPy2010Aus1n,TX
HumanGenomeProject
10YearsThousandsofSequencers$3,000,000,00021BillionBasePairs(Gbp)
ModernSequencing
2weeksOneSequencer$6,000100‐200Gbp
TCA-AGCAGCAGGA ||| || ||| ||| TCATAG-AGCGGGA
10 GB – 2 TB of raw data is transferred
from the instrument to the analysis cluster.
Next Generation Sequencing (NGS)
.
. >187_29_706_F3 T23302010303131123123022203111123200210100122001102 >187_29_800_F3 T31120012213222002222130121121122112032220323121202 >187_29_824_F3 T22211130023020133231323302310303131123123022201211 >187_29_829_F3 T23302010003130123123022203111120122123202132301212 >187_29_858_F3 T23302010303131123123022203111123222123122122321212 >187_29_885_F3 T23302010303131123123022203111121220013212122021222 . .
The sequence from each bead is reported in a data file. Files
can be over 100 GB.
Sample prep breaks the DNA or RNA into short
segments that are attached 500 million to 1
billion beads.
+
The bead sequences are either assembled into a new
genome or “mapped” to a reference genome.
=
The mapped reads from RNA samples can be further analyzed to
determine which genes are active in the sample.
Next Generation Sequencing (NGS) instruments such as Life Technologies SOLiD™ sequence genomic material in millions of small pieces, enabling a high-level of throughput and sequencing depth.
AcquireaReference
SequencethediseasedgenomeGATCAACTTG AGGCCAGCCT GACCAACGTG GCAAAACTCC ATCTCTACTA AATTAGCTGA GCGTGGTGGC ACGCACCTGT CATCCCAGTT ACTCAGGAGG AGAATGGCTT GAGCCTGGGA GACAGAGGTT GCAGTGAGCT GAGATTGCAT TAGCCTGGGT GACAGAGTGA AATGGAGGGA GGAAAAAAAA AAAAAAGGAA AGGGAGCCAG CCTAGGATGG GGAAGGCTCA CCAGAAGTGG ATGCAAAGAG GAGCTCATTC TATTTGCCTA GGAAAGAAAA ACGTCCAGAA ACCTGGCCTT GCCGAGGCCC TCCAGGAAAG CCAGGCAGAC CCTGCTCCTG CTCTGACCCC
SequencetheexpressedRNA
Annotatethediseasedgenome
Compareagainstthehealthysample
The Goal: Medical Genomics
2‐3HDMovies
6%ofGoogle’sIndex
30xGoogle’sDailySearchTraffic
Sequencethediseasedgenome100Gbp36Hours/4servers
100Gbp15BillionSearchOpsSequencetheexpressedRNA
Annotatethediseasedgenome MillionsofFeatures
CompareagainstthehealthysampleHumanDrivenAnalysisandInterpreta1on
AcquireaReference 10GB
GATCAACTTG AGGCCAGCCT GACCAACGTG GCAAAACTCC ATCTCTACTA AATTAGCTGA GCGTGGTGGC ACGCACCTGT CATCCCAGTT ACTCAGGAGG AGAATGGCTT GAGCCTGGGA GACAGAGGTT GCAGTGAGCT GAGATTGCAT TAGCCTGGGT GACAGAGTGA AATGGAGGGA GGAAAAAAAA AAAAAAGGAA
AGGGAGCCAG CCTAGGATGG GGAAGGCTCA CCAGAAGTGG ATGCAAAGAG GAGCTCATTC TATTTGCCTA GGAAAGAAAA ACGTCCAGAA ACCTGGCCTT GCCGAGGCCC TCCAGGAAAG CCAGGCAGAC CCTGCTCCTG CTCTGACCCC GATATGTAGA AAAGAGGAGA TGGGCTTTGG CCCAGAGGAC AGCAGCGTTA CAGTTCCCAG TCGGTTCAAG GTTGCTAGGC TCAGCGAACT GCAAGTCCCT TTTCTTCCTA AAGGTCCCCA GTTCCTCATG ATTCTTCTGA GGGTCTCATC GGGCCTGCAG TCAGCTAGCC ACCCCACTGC CCCATGCCTG CAGTGAAGAC ACCTGGAAAT GGCTGGTGAC AGAAAAGTCC TCAGGGCCAC AGCACTCTCT TTCAGGTGCC TTGCCTGATG GTGACAAGGC TGGTTTTGCA TAAACAGCTC ACCCAGATGT GGCTTCTGAC TTGAGTGGAC CCCCATGAAG AGCTGCAGAG AAAGACAAGA GAATGGAGAA AATGAGGAGA AAGGAAACTA CAGAGTGAGA ATCTGACAGG TGTGACCACA GCAGTGTGAC ATTTATGAAC AGTGTGGAAA GCCCCTGAAT TAAAAACTTC CTGGGGAAAT AAGCCACTCC ACATATCGGT AGAGTGGAGG AGGCTGTTGA CCCGTGTGTG TCCCCATGAC TCAAAGAGGG CTGTTTCCAA TATCCCGAAA TCAGTCTTGC TGGGAGAACT GGGAAAATAA
ACCCCCGTAG GAAGCTACCT TTAATCCCAA GTGCCCAAGG CTAGGAGAGA GGCGATCCAG GACACCAGTG ACTGACACAG CCAGAGGTGG GAAAGGGGAG GCACAAAAGT GAGGAGTGAG CAAGGGTCTG AGAGGGAAGG CCATGTGGGC ACCCACATCA GAGACTGACA TGAGGATTAA AGGAGAGCAT AGGTGATCGG ACAGAAGAGA GGCAGCTCTA CACCCCTTGC TTGCAATTCT GAGCATTCTG GTTTGGCCAT CAAACCAGAC CTCAATTGAG ACAAGGCTAT TTAAGCTTCC
vs.
50‐500GBRetainedData
Personal Genomics
4MillionServers 1.2millionserversweresoldinQ42009
TheLargeHadronCollidergenerates300EB/year
3YearsonUT’sRangerSupercomputer
AcquireaReference 1PetabyteofBandwidth 1BillionYouTubeVideos
72MillionHours
1Exabyte/Day
Sequencethediseasedgenome
SequencetheexpressedRNA
Annotatethediseasedgenome
27,000RadiologistsintheUSCompareagainstthehealthysample
Population Genomics
CanPythonsavenextgenera1onsequencing?
Probablynotonitsown…
HowcanPythonhelpnextgenera1onsequencing?
Howcannextgenera1onsequencinghelpPython?
Beherques1ons:
So…
NGSDataandWorkflowComponents
ReadsRefs 3°
Assembly Mapping
Annota1onVaria1on
Interac1ons Expression
EDA NovelAppsExploratory Analysis
Undirected analysis and unforeseen sequencing applications
Standard Scientific Workflows
Understand the structure and function of genomic elements
Fundamental Algorithms Transform the raw data into scientifically relevant forms
References and Data
Reference genomes, domain-specific data sets, raw data, analysis results
ReadsRefs 3°
Assembly Mapping
Annota1onVaria1on
Interac1ons Expression
EDA NovelApps
Algorithms Considera1onsDataandWorkflows
Graphs,Indexschemes,dynamicprogramming
GenerallyI/Oboundwithopportuni1esforparallelism.
Standarddataformatsaretext‐based
Caneasilyspan100sofTBsforasmalllab.Referencesareinconstantflux.
Dataissmaller,butanalysismayrequireroundtripsbacktoreads.
Clustering,sta1s1calmodels,networkanalysis
Informa1onvisualiza1on,data‐mining
Interac1vityisessen1al.
Algorithms and Scale
ReadsRefs 3°
Assembly Mapping
Annota1onVaria1on
Interac1ons Expression
EDA NovelApps
So8ware Hardware
Graphlibraries,pipelinemanagementsonware,jobschedulers
Clusterswithhigh‐memorynodes,fastaccesstostorage.
Databases,ORMtools,Flatfiles
Distributedfilesystems,FastSANs
Singlenodeswithfastaccesstostorage
Scrip1nglanguages,dataanalysislibraries
Genomebrowsers,Matlab,Statstools,R,etc
Worksta1ons,laptopswithfastaccesstostorage
DataandWorkflows
Software and Hardware
ReadsRefs 3°
Assembly Mapping
Annota1onVaria1on
Interac1ons Expression
EDA NovelApps
Current Poten1al
Rapidprototyping,pipelinemanagement,u1li1es
Mul1‐processing,queuingtobuildoutpipelinemanager
Pythonparsersforvariousformats,BioPython,SAMToolsPygrforreferenceandannota1onaccess
DiscoorHadoopfordistributedreadmanagement
Galaxy,user‐developedtools
GUIs,Disco
DataandWorkflows
Python and NGS
Usertools,u1li1esshippedwithassemblersandmappers,NGSlibraries(HTSeq),NumPy,SciPy
Morelibraries!
Example:Chromosome20ExpressionAnalysis
Reads
Refs
Annota1on
Mapping
Expression
ReferenceGenome
Posi1veStrand
UHR
HBR
HELA
UHR
HBR
HELA
Nega1veStrand
Visualiza1on
A portion of the Chromosome 20 expression map created using reads from SOLiD™ at Life Technolgies’ Austin site. The full map spans over 260 feet when printed sequentially. Chromosome
20 represents 2.1% of the human genome.
Custom Python tool rendering to PDF using ReportLab
LifeTech WT pipeline for mapping and expression analysis
Custom Python scripts for result aggregation
14 Billion reads from three sample types. 1400+ files comprising 4TB of sequence data
Human Genome 36.3 Human RefSeq 39
CulturalChallenges
• Goals– Mostuserscareaboutthescienceandjustwantthetechnologytowork
• Skills– Fewbiologistsandbioinforma1ciansarealsoexpertsinHPCandtera‐scaledatamining
• Expecta1ons– Excelprovidesanswersinstantly.NGSanalysisshould,too.
• Cost– Ijustspent$500konaninstrumentandIneedtospendhowmuchonacomputeranddeveloper???
As recently as five years ago, computing in the Life Sciences consisted of pencils, lab notebooks, and the occasional Excel spreadsheet.
Needless to day, NGS caught the community completely off guard.
FinalThoughts• NGSanalysisisinitsinfancy
– Fundamentalmethodsares1llbeingdeveloped– Thescien1ficcommunityiss1llcomingtogripswithitspoten1alandchallenges
– Therewon’tbeanysilverbulletsinthenearfuture• NGSanalysisiscomplicatedbythescaleofdata
– Tradi1onalsupercompu1ngdoesn’thelpmuch
• Developmentparadigmsthatsimplifylargedataprocessingwillsucceedinthisspace
• ManyPythonprojectsshowpromise,butthere’ss1llworktobedone!
ThankYou!LifeTechnologiesAus1n
Bioinforma)csJeffSchagemanJoelBrockmanPennWhitleyDanWilliams
TranscriptomicsSheilaHeaterKelliBramlehDianeIlsley
ThePowersthatBeBobSeherquistTimSendera
LifeTechnologiesGlobal
So5wareandHPCLeeJonesPatrickLeGresleySomaleeDahaAsimSiddiquiAaronKitzmillerKeithMoultonMikeLyonsYingZu
ITMichaelMooreAntoineUzzeniDavidMorganJoannaCurlee
SoundingBoards
PeterWangGlenOteroTravisOliphant
© 2010 Life Technologies Corporation. All rights reserved. The trademarks mentioned herein are the property of Life Technologies Corporation or their respective owners. Void where prohibited, prohibited where void. For Research Use Only. Not intended for any animal or human therapeutic or
diagnostic use.
WhatPeopleareUsing• CompiledLanguages:
– EquallydistributedbetweenC,C++,andJava
• InterpretedLanguages:– AlmosteveryoneusesPerl,withabouthalftherespondantsusingRorPython– But,thefordailyusage,about31%usedPythonand47%usedPerl
• ForLibraries:– 75%usedBioPerlonaregularbasis.SciPyisthenextmostusedat33%.
BioPythyonwaslowat16%
• ForSta1s1cs:– R(70%),Excel(53%),andPerl(35%)werethemostcommon
• ForVisualiza1on:– ExcelwasthemostcommonwithRandGNUPlotalsogeungmul1ple
responses
ComputeEnvironment
• Peta‐scalestorageenvironment• Mul1‐coreprocessorsandmul1plenodes
– thoughnotasmanyasyouthink• 64+GBofRAMhelps• Heterogeneousprocessors
– SIMD/GPUs• Mul1plelanguages
– C,C++,andJavaarecommonfor‘fast’code– ManytoolsshipwithPerlorPythonu1li1es
• Thecloudwillmaheratsomepoint
IdealStack
• Mustbeuseablebyawiderangeofdevelopersandscien1sts
• Standarddataformats• ReferenceManagement• Op1mizedkernelopera1ons• Op1mizedrecord‐basedopera1ons• Clusterawarepipelinemanagement• Horizontallyscalablerecordstorage• Databaseintegra1onforLIMs• Interac1vevisualiza1on
top related