the genomic hyperbrowser · dna as a line • this is indeed the dynamic perspective! • dna...
TRANSCRIPT
The Genomic HyperBrowser
Statistical genome analysismade accessible and reproducible
Sveinung GundersenElixir.no, UiO
Credit
• Based on a presentation Assoc. Prof. Geir Kjetil Sandve held at a meeting in Oxford, may 7th, 2013
Focus
• Downstream analysis of high-level genome-scale data
• You want to compare your data with existing data collections
• But..
• how to find the questions they can answer?
• how to go about answering questions at this scale?
Outline
• A bioinformatician’s view on genomics
• Analyzing genomic track data
• Under the hood of the analysis tools
• A quick tour of HyperBrowser features
Outline
• A bioinformatician’s view on genomics
• Analyzing genomic track data
• Under the hood of the analysis tools
• A quick tour of HyperBrowser features
What is a reference genome?
• It’s a bunch of sequence
• Human genome a collection of ~3 billion nucleotides
• It’s a map!
• Where sequences belong in relation to each other
• Essentially makes up a line
Genome
The whiteboard and the computer file
Genome
Reference genome acts like
coordinate system for genomic data
chr21!10079666!10120808!NM_001187chr21!13332357!13412442!NR_026916chr21!13700575!13700652!NR_036164chr21!13904368!13935777!NM_174981chr21!14137324!14142556!NR_026755
DNA as a line
• This is indeed the dynamic perspective!
• DNA doesn’t change that much from hour to hour, or cell to cell
• But a lot happens along the DNA: binding by TFs, modifications of histones, ...
• Even for gene expression or SNPs we can usually abstract away from the underlying sequence
• Functional genomics typically refers to the genome as a line (map), not as sequence
The UCSC Genome Browser
Public data- ENCODE, FANTOM, GEO, Roadmap Epigenomics ..
• By now, Big Science provides:
• Chromatin accessibility (DHSs) for ~350 cell samples
• Binding of ~100 TFs in several cell types
• Most histone modifications in several cell types
• Gene expression for thousands of setups
• TSS and active promoters in ~950 cell samples
• DNA methylation, 3D genome structure, ...
Outline
• A bioinformatician’s view on genomics
• Analyzing genomic track data
• Under the hood of the analysis tools
• A quick tour of HyperBrowser features
Exploiting the data
• Data is becoming less of a bottleneck
• With so much public data, some is likely to be relevant
• Producing broad amounts of new data is often within reach
• But, asking the right questions is still tricky
• Forming interesting hypotheses is no easier than before
• The large scale complicates analysis
This can’t be it?!
?
Cell types and MS associated regions
• Regions of the genome are not always active
• Varies e.g. between cells types
• Due to e.g. modification of histones
• In which cell types are MS associated regions active?
Cell-type specific activity of MS regions
• MS GWAS SNP locations along genome
• Histone modification-derived chromatin states along the genome, in 9 cell types
• Derived from ENCODE data (Nature, 473, 43–49)
• Are regions around MS GWAS SNPs unexpectedly active in B-cells (gm12878)?
A simple approach
• Do MS regions overlap more than expected with B-cell AP regions?
• But, this is really a bit too simple
A more reasonable approach (and still quite straightforward)
• Do MS regions overlap unexpectedly more with B-cell than e.g. stem cell regions?
• Yes!
• (“Genomic regions associated with multiple sclerosis are active in B cells”, PLoS One. 2012;7(3))
Outline
• A bioinformatician’s view on genomics
• Analyzing genomic track data
• Under the hood of the analysis tools
• A quick tour of HyperBrowser features
Delineating basic types of genomic tracks
Points
Segments
Function
Bins
Track types:7 basic track types
Genome Partition (GP)
Step Function (SF)
Function (F)
Points (P)
Segments (S)
Valued Points (VP)
Valued Segments (VS)
Track types:8 advanced track types
Linked Points (LP)
Linked Segments (LS)
Linked Genome Partition (LGP)
Linked Valued Points (LVP)
Linked Valued Segments (LVS)
Linked Step Function (LSF)
Linked Base Pairs (LBP)
Linked Function (LF)
S-S Overlap
The troubling random nature
• Counting overlap is straightforward
• But statistical testing requires random data
• “The multitudes of possible genomes that evolution might have produced for our and other species”
• Must find something that is reasonable enough
• Does appropriate randomness match statistical tests?
Tracing assumptions
• Textbook Wilcoxon H0:
• Values (4) independent and symmetric around 0
• But what is assumed on the genomic track data?
A grammar for null models
• Specifying assumptions:
• Which of the tracks should be randomized?
• Which properties should still be preserved?
• How should track elements be randomized?
• Computing p-values according to model
• Exact/asymptotic test if assumptions match
• Monte Carlo with explicit randomization if needed
Outline
• A bioinformatician’s view on genomics
• Analyzing genomic track data
• Under the hood of the analysis tools
• A quick tour of HyperBrowser features
Tracks suitable for analysis
Basic trackrepresentation
External trackcollection
(UCSC, ENCODE)
Galaxy historydata
Explorative plotsof tracks and
relations
Visualization
(Table 2)
5 tools
Hypothesessupported
by data
Hypothesis testing
(Table 1)
Analyze genomic tracks
Unsupervisedsubgrouping
of tracks
Clusteringanalysis
(Table 2)
Cluster tracks
Hypotheses on3D co-localizationsupported by data
3Danalysis
(Table 2)
Analyze spatial
co-localization
Generatetracks
(Table 3)
6 tools
HB trackrepository
(Table 3)
Extracttracktool
Customizetracks
(Table 3)
4 tools
Data preparationData customizationAnalysis
Spreadsheet /WDEXODU�ÀOHV
Format & convert
(Table 3)
2 tools
Statisticson tracks and
relations
Descriptivestatistics
(Table 1)
Analyze genomic tracks
Tracks suitable for analysis
Basic trackrepresentation
External trackcollection
(UCSC, ENCODE)
Galaxy historydata
Explorative plotsof tracks and
relations
Visualization
(Table 2)
5 tools
Hypothesessupported
by data
Hypothesis testing
(Table 1)
Analyze genomic tracks
Unsupervisedsubgrouping
of tracks
Clusteringanalysis
(Table 2)
Cluster tracks
Hypotheses on3D co-localizationsupported by data
3Danalysis
(Table 2)
Analyze spatial
co-localization
Generatetracks
(Table 3)
6 tools
HB trackrepository
(Table 3)
Extracttracktool
Customizetracks
(Table 3)
4 tools
Data preparationData customizationAnalysis
Spreadsheet /WDEXODU�ÀOHV
Format & convert
(Table 3)
2 tools
Statisticson tracks and
relations
Descriptivestatistics
(Table 1)
Analyze genomic tracks
Current focus
• Main focus:
• Simple fetching collections of genomic tracks from public sources
• Handling of multi-track collections and analysis
• Better integration of HyperBrowser with NeLS and TSD
Future directions
• Analyzing predictor-enhancer interaction, taking high-resolution chromosome conformation data into account
• Better handling of phenotype information (for pharmacology collaboration projects)
• Several other collaboration projects
Publications
• Core statistical analysis system
• “The Genomic HyperBrowser: inferential genomics at the sequence level” (Genome Biology, 2010)
• “The Genomic HyperBrowser: an analysis web server for genome-scale data” (Nucleic Acids Research, 2013)
• Types of genomic tracks
• “Identifying elemental genomic track types and representing them uniformly” (BMC Bioinformatics, 2011)
Publications
• Google maps of many-to-many analyses
• “The differential disease regulome” (BMC Genomics, 2011)
• 3D genome structure analysis
• “Handling realistic assumptions in hypothesis testing of 3D co-localization of genomic elements” (Nucleic Acids Research, 2013)
The team
Knut Liestøl
Eivind Tøstesen
Sigve Nakken
Halfdan Rydbeck
Geir Kjetil Sandve
Trevor Clancy Fang
Liu Sveinung Gundersen
Ingrid K.
Lars
Arnoldo Frigessi
Eivind HovigMorten Johansen
Marit HoldenVegard NygaardEgil Ferkingstad
2008
2012
Support
Conclusion
• If you want to do genome analysis, and don’t want to reinvent the wheel:
• Google “HyperBrowser” and try out the web system
• PubMed “HyperBrowser” and skim 2013 NAR article