topologicalassociated domainsidentification usinghi cxiaoman/spring/lecture 19 topological...

30
Topological Associated Domains identification using HiC Modified from Djekidel Mohamed Nadhir

Upload: others

Post on 02-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Topological AssociatedDomains identification

using Hi‐CModified from Djekidel Mohamed Nadhir

Page 2: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Structural Organization of Chromatin

Interaction between TADs of the same epigenetic type give rise to compartments

Chromosome territories are formed by coalescence of compartments

A compartments are active and localize near nuclear speckles

B compartments are inactive and localize near the nuclear envelope

Chromatin is organized into TADsfrom Hansen et al., Nucleus 9, 20 (2018)

Page 3: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Hi‐C for understanding 3D structure

• Despite revealing the sequence of the genome, little is known about its 3D structure

•high‐throughput chromosome capture (Hi‐C) is 3C‐based technology

it can detect chromatin interactions between loci across the entire genome

Biological experiment:

Ming, H., et al. (2013). "Understanding spatial organizations of chromosomes via statistical analysis of Hi‐C data."Quantitative Biology 1.

Page 4: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Hi‐C in the chromatinconformation study map

Smallwood, A. and B. Ren (2013). "Genome organization and long‐range regulation of gene expression by enhancers." Current opinion in cell biology 25(3):387‐394.

Page 5: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Data Processing Pipeline• 4main steps:

• Readmapping : Each side (50 bp) is mapped independently to the reference genome

• Read level filtering

• Fragment filtering : Filter fragments with low mappability score

• Creation of the Hi‐C contactmatrix

Ming, H., et al. (2013). "Understanding spatial organizations of chromosomes via statistical analysis of Hi‐C data."Quantitative Biology 1.

Page 6: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Read filtering step• The flowing types of reads should be removed :

• Self‐ligation reads:

• Dangling reads : un‐ligated reads

• PCR amplification reads:many reads that map to the same location

• Random breaking reads : reads located far from the enzyme cutting site ( 1 2 500 )

Page 7: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Fragment filtering step• Remove fragments with lowmappability score (< 0.5)

• fragment near centromere or telomere regions tends to contain a large proportion of repetitive sequence andleads to a lowmappability score

• Additional suggestions :

• Remove fragments with <100bp or > 100 kb

• Remove 0.5% of the fragments with the highest number of reads (can be source of PCR artifacts)

Page 8: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Construction of the Hi‐C interaction matrix• The number of Enzyme cut‐site is 1012, however a typical Hi‐C experiment generate 108 reads

• Thus, we need to partition the genome into large scale bins.

Hi‐C vs FISH

Page 9: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Discussed paper

• Aim :

• Investigate the 3D organization of the human and mouse genome in ES anddifferentiated cells.

• Data :

• Mouse :

• Mouse embryonic stem cell (mESC)

• Cortex cell (generated by another group)

• Human :

• Human embryonic stem cell (hESC)

• IMR90

Page 10: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Data control (1)• Remove cut site bias

Raw data Normalized data

Page 11: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Data control (2)Compare 5C generated data for the HoxA 

locus (correlation > 0.73) Compare with Phc1 locus 3C data

Compare with FISH data of 6 loci

Page 12: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Data control (3)

PearsonCorrelation between replicates

Page 13: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Visualization of interactions

We can notice aTopologicalAssociated Domain (TAD) structure at bins < 100kb

Page 14: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Identification of topological domainsStep1: Detection of the interaction bias

We notice that in aTAD that :

• The upstream portion is highly biased to interact downstream

• The downstream portion is highly biased to interact upstream

a directionality index (ID) was defined to calculate this bias:

• 0 Upstream bias

• 0Downstream bias

• the extent of the interaction

Page 15: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

DI calculation

Steps:

• The genome was split into bins of length 40 kb

• Let :

• A: # of reads that map in the 2M upstream of the bin

• B: # of reads that map in the 2M downstream of the bin

• E: expected number of reads 𝐄 =𝑨+𝑩

𝟐

• Then :

• 𝐷𝐼 =𝐵−𝐴

𝐵−𝐴

𝐴−𝐸 2

𝐸+

𝐵−𝐸 2

𝐸

-2Mb +2Mb40kb

A B

Page 16: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Domain detection (1)• Each bin can have 3 states :

• Upstream biased

• Downstream biased

• No bias

• Use a HMM based on the DI to infer the biased state

• We define :

• 𝒀 = [𝒀𝟏, 𝒀𝟐, … , 𝒀𝒏] : The observed DI

• 𝑸 = [𝑸𝟏, 𝑸𝟐, … , 𝑸𝒏] : The hidden bias 𝑄𝑖 ∈ {𝐷, 𝑈, 𝑁}

• 𝑴 = 𝑴𝟏, 𝑴𝟐, … ,𝑴𝒎 : 𝑚 ∈ [1,20]

• The probabilities are calculated as follow:

• 𝑷 𝒀𝒕 𝑸𝒕 = 𝒊,𝑴𝒕 ) = 𝓝 𝐘𝐭; 𝝁𝒊𝒎, 𝚺𝒊𝒎

• 𝑷 𝑴𝒕 = 𝒎 𝑸𝒕 = 𝒊) = 𝑪(𝒊,𝒎)

• 𝑪(𝒊,𝒎): the mixture weight

D D D D U U U N N N D D D U U

Domain Boundary Domain

` ` `

𝑀1 𝑀2𝑀3

𝑸𝒕

𝒚𝒕

𝑴𝒕

𝑸𝒕+𝟏

𝒚𝒕+𝟏

𝑴𝒕+𝟏

DU

N

Page 17: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Domain detection (1)

• The region between two TAD is termed :

• Topological boundary : if size < 400kb

• Unrecognized chromatin : if size ≥ 400 kb

Page 18: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

What separates two TADs

• Studied the HoxA locus known to be separated into two compartments

• Found that the CS5 insulator resides in the boundary

• Maybe insulators are enriched at the boundary ?

Page 19: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

CTFC role in the boundary

• Studied other known insulator CTCF

Page 20: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Heterochromatin and boundary

• the H3K9me3 profile changed between cells hESC and IMR90 but the boundaries structure didn’t change

• potential link between the topological domains and transcriptional control in the mammalian genome

Page 21: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Characteristics of TAD

• TAD are stable between cell lines

hESC

IMR90

Page 22: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Characteristics of TAD

• TAD are conserved between species

Page 23: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Cell type specific interactions

• A binomial test is performed for each 20kb bin to determine is it is cell specific

• Calculate 𝒏 = 𝑰𝒎𝑬𝑺𝑪 + 𝑰𝒄𝒐𝒓𝒕𝒆𝒙 , the number of possible interactions at a distance 𝒅

• Calculate the expected value 𝒑 =𝑰𝒎𝑬𝑺𝑪

𝒏or 𝒑 =

𝑰𝒄𝒐𝒓𝒕𝒆𝒙

𝒏

• Then for each bin do a binomial-test to see if there is a deviation in the number cell specific

interactions

d d d d

𝒏 = 𝟑 + 𝟐 + 𝟏 + 𝟏 + 𝟐 + 𝟏 + 𝟒 + 𝟏 = 𝟏𝟓

mESC

Cortex

𝒑 =𝟕

𝟏𝟓or 𝒑 =

𝟖

𝟏𝟐

Page 24: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Cell type specific interactions

• 20% of the genes that have a FC≥ 4 are found in dynamic interacting loci.

• > 96% of the dynamic interactions occur in the same domain.

• Model :

• domain organization is stable between cell types

• but the regions within each domain may be dynamic,

Page 25: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Factors forming the boundary (1)

• Boundaries are enriched for active promoter signals and gene bodies

Page 26: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Factors forming the boundary (2)

Page 27: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

TAD vs A/B compartments (1)

• Loci found clustered in A compartments are generally:

• gene rich,

• transcriptionally active,

• and DNase I hypersensitive,

Lieberman-Aiden, E., et al. (2009), Science (New York, N.Y.) 326(5950): 289-293.

Compartment B

Compartment A

• Loci found clustered in B compartments are generally:

• gene poor,

• transcriptionally silent

• and DNase I insensitive

At a higher order the chromatin is organized into A and B compartments

Page 28: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

TAD vs A/B compartments (2)

TAD are smaller than A/B compartments

Page 29: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Summary

• The mammalian genome is segmented into a megabase-scale domains

• Domain boundaries are stable between cell lines and species , suggesting that they are a basic property of the chromosome architecture.

• Domain boundaries are enricher for :

• Transcriptionally active genes

• Coincide with heterochromatin boundaries

• Enriched with insulator proteins

• Enriched with tRNA, SINE and housekeeping genes

• Developed many data-analysis approaches

Page 30: TopologicalAssociated Domainsidentification usingHi Cxiaoman/spring/lecture 19 topological associated domains.pdfHi‐C for understanding 3D structure ... • potential link between

Summary

• The mammalian genome is segmented into a megabase-scale domains

• Domain boundaries are stable between cell lines and species , suggesting that they are a basic property of the chromosome architecture.

• Domain boundaries are enricher for :

• Transcriptionally active genes

• Coincide with heterochromatin boundaries

• Enriched with insulator proteins

• Enriched with tRNA, SINE and housekeeping genes

• Developed many data-analysis approaches