the genomic hyperbrowser · dna as a line • this is indeed the dynamic perspective! • dna...

Post on 18-Jul-2020

2 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The Genomic HyperBrowser

Statistical genome analysismade accessible and reproducible

Sveinung GundersenElixir.no, UiO

Credit

• Based on a presentation Assoc. Prof. Geir Kjetil Sandve held at a meeting in Oxford, may 7th, 2013

Focus

• Downstream analysis of high-level genome-scale data

• You want to compare your data with existing data collections

• But..

• how to find the questions they can answer?

• how to go about answering questions at this scale?

Outline

• A bioinformatician’s view on genomics

• Analyzing genomic track data

• Under the hood of the analysis tools

• A quick tour of HyperBrowser features

Outline

• A bioinformatician’s view on genomics

• Analyzing genomic track data

• Under the hood of the analysis tools

• A quick tour of HyperBrowser features

What is a reference genome?

• It’s a bunch of sequence

• Human genome a collection of ~3 billion nucleotides

• It’s a map!

• Where sequences belong in relation to each other

• Essentially makes up a line

Genome

The whiteboard and the computer file

Genome

Reference genome acts like

coordinate system for genomic data

chr21!10079666!10120808!NM_001187chr21!13332357!13412442!NR_026916chr21!13700575!13700652!NR_036164chr21!13904368!13935777!NM_174981chr21!14137324!14142556!NR_026755

DNA as a line

• This is indeed the dynamic perspective!

• DNA doesn’t change that much from hour to hour, or cell to cell

• But a lot happens along the DNA: binding by TFs, modifications of histones, ...

• Even for gene expression or SNPs we can usually abstract away from the underlying sequence

• Functional genomics typically refers to the genome as a line (map), not as sequence

Public data- ENCODE, FANTOM, GEO, Roadmap Epigenomics ..

• By now, Big Science provides:

• Chromatin accessibility (DHSs) for ~350 cell samples

• Binding of ~100 TFs in several cell types

• Most histone modifications in several cell types

• Gene expression for thousands of setups

• TSS and active promoters in ~950 cell samples

• DNA methylation, 3D genome structure, ...

Outline

• A bioinformatician’s view on genomics

• Analyzing genomic track data

• Under the hood of the analysis tools

• A quick tour of HyperBrowser features

Exploiting the data

• Data is becoming less of a bottleneck

• With so much public data, some is likely to be relevant

• Producing broad amounts of new data is often within reach

• But, asking the right questions is still tricky

• Forming interesting hypotheses is no easier than before

• The large scale complicates analysis

This can’t be it?!

?

Cell types and MS associated regions

• Regions of the genome are not always active

• Varies e.g. between cells types

• Due to e.g. modification of histones

• In which cell types are MS associated regions active?

Cell-type specific activity of MS regions

• MS GWAS SNP locations along genome

• Histone modification-derived chromatin states along the genome, in 9 cell types

• Derived from ENCODE data (Nature, 473, 43–49)

• Are regions around MS GWAS SNPs unexpectedly active in B-cells (gm12878)?

A simple approach

• Do MS regions overlap more than expected with B-cell AP regions?

• But, this is really a bit too simple

A more reasonable approach (and still quite straightforward)

• Do MS regions overlap unexpectedly more with B-cell than e.g. stem cell regions?

• Yes!

• (“Genomic regions associated with multiple sclerosis are active in B cells”, PLoS One. 2012;7(3))

Outline

• A bioinformatician’s view on genomics

• Analyzing genomic track data

• Under the hood of the analysis tools

• A quick tour of HyperBrowser features

Delineating basic types of genomic tracks

Points

Segments

Function

Bins

Track types:7 basic track types

Genome Partition (GP)

Step Function (SF)

Function (F)

Points (P)

Segments (S)

Valued Points (VP)

Valued Segments (VS)

Track types:8 advanced track types

Linked Points (LP)

Linked Segments (LS)

Linked Genome Partition (LGP)

Linked Valued Points (LVP)

Linked Valued Segments (LVS)

Linked Step Function (LSF)

Linked Base Pairs (LBP)

Linked Function (LF)

S-S Overlap

The troubling random nature

• Counting overlap is straightforward

• But statistical testing requires random data

• “The multitudes of possible genomes that evolution might have produced for our and other species”

• Must find something that is reasonable enough

• Does appropriate randomness match statistical tests?

Tracing assumptions

• Textbook Wilcoxon H0:

• Values (4) independent and symmetric around 0

• But what is assumed on the genomic track data?

A grammar for null models

• Specifying assumptions:

• Which of the tracks should be randomized?

• Which properties should still be preserved?

• How should track elements be randomized?

• Computing p-values according to model

• Exact/asymptotic test if assumptions match

• Monte Carlo with explicit randomization if needed

Outline

• A bioinformatician’s view on genomics

• Analyzing genomic track data

• Under the hood of the analysis tools

• A quick tour of HyperBrowser features

Tracks suitable for analysis

Basic trackrepresentation

External trackcollection

(UCSC, ENCODE)

Galaxy historydata

Explorative plotsof tracks and

relations

Visualization

(Table 2)

5 tools

Hypothesessupported

by data

Hypothesis testing

(Table 1)

Analyze genomic tracks

Unsupervisedsubgrouping

of tracks

Clusteringanalysis

(Table 2)

Cluster tracks

Hypotheses on3D co-localizationsupported by data

3Danalysis

(Table 2)

Analyze spatial

co-localization

Generatetracks

(Table 3)

6 tools

HB trackrepository

(Table 3)

Extracttracktool

Customizetracks

(Table 3)

4 tools

Data preparationData customizationAnalysis

Spreadsheet /WDEXODU�ÀOHV

Format & convert

(Table 3)

2 tools

Statisticson tracks and

relations

Descriptivestatistics

(Table 1)

Analyze genomic tracks

Tracks suitable for analysis

Basic trackrepresentation

External trackcollection

(UCSC, ENCODE)

Galaxy historydata

Explorative plotsof tracks and

relations

Visualization

(Table 2)

5 tools

Hypothesessupported

by data

Hypothesis testing

(Table 1)

Analyze genomic tracks

Unsupervisedsubgrouping

of tracks

Clusteringanalysis

(Table 2)

Cluster tracks

Hypotheses on3D co-localizationsupported by data

3Danalysis

(Table 2)

Analyze spatial

co-localization

Generatetracks

(Table 3)

6 tools

HB trackrepository

(Table 3)

Extracttracktool

Customizetracks

(Table 3)

4 tools

Data preparationData customizationAnalysis

Spreadsheet /WDEXODU�ÀOHV

Format & convert

(Table 3)

2 tools

Statisticson tracks and

relations

Descriptivestatistics

(Table 1)

Analyze genomic tracks

Current focus

• Main focus:

• Simple fetching collections of genomic tracks from public sources

• Handling of multi-track collections and analysis

• Better integration of HyperBrowser with NeLS and TSD

Future directions

• Analyzing predictor-enhancer interaction, taking high-resolution chromosome conformation data into account

• Better handling of phenotype information (for pharmacology collaboration projects)

• Several other collaboration projects

Publications

• Core statistical analysis system

• “The Genomic HyperBrowser: inferential genomics at the sequence level” (Genome Biology, 2010)

• “The Genomic HyperBrowser: an analysis web server for genome-scale data” (Nucleic Acids Research, 2013)

• Types of genomic tracks

• “Identifying elemental genomic track types and representing them uniformly” (BMC Bioinformatics, 2011)

Publications

• Google maps of many-to-many analyses

• “The differential disease regulome” (BMC Genomics, 2011)

• 3D genome structure analysis

• “Handling realistic assumptions in hypothesis testing of 3D co-localization of genomic elements” (Nucleic Acids Research, 2013)

The team

Knut Liestøl

Eivind Tøstesen

Sigve Nakken

Halfdan Rydbeck

Geir Kjetil Sandve

Trevor Clancy Fang

Liu Sveinung Gundersen

Ingrid K.

Lars

Arnoldo Frigessi

Eivind HovigMorten Johansen

Marit HoldenVegard NygaardEgil Ferkingstad

2008

2012

Support

Conclusion

• If you want to do genome analysis, and don’t want to reinvent the wheel:

• Google “HyperBrowser” and try out the web system

• PubMed “HyperBrowser” and skim 2013 NAR article

top related