bioinformatic analysis of chromatin genomic data giulio pavesi university of milano...

Bioinformatic Analysis of Bioinformatic Analysis of Chromatin Genomic Chromatin Genomic

DataDataGiulio PavesiGiulio Pavesi

University of MilanoUniversity of Milano

[email protected]@unimi.it

““Nucleosome”Nucleosome”

The nucleosome core particle The nucleosome core particle consists of approximately 147 base consists of approximately 147 base pairs of DNA wrapped in 1.67 left-pairs of DNA wrapped in 1.67 left-handed superhelical turns around a handed superhelical turns around a histone octamer histone octamer

Octamer: 2 copies each of the core Octamer: 2 copies each of the core histones H2A, H2B, H3, and H4histones H2A, H2B, H3, and H4

Core particles are connected by Core particles are connected by stretches of "linker DNA", which can stretches of "linker DNA", which can be up to about 80 bp longbe up to about 80 bp long

EpigeneticsEpigenetics

Modern Modern experimental experimental techniques and techniques and technologies allow technologies allow for the genome-for the genome-wide study of wide study of different types of different types of histone histone modifications, modifications, shedding light on shedding light on the role of each onethe role of each one

The histone codeThe histone code

Example Example H3H3K4K4me3me3 H3H3 is the histoneis the histone K4 K4 is the residue that is modified is the residue that is modified

and its position (K lysine in and its position (K lysine in position 4 of the sequence)position 4 of the sequence)

me3me3 is the modification (three- is the modification (three-methyl groups attached to K4) methyl groups attached to K4)

If no number at the end like in If no number at the end like in H3H3K9K9acac means only one group means only one group

Different chromatin statesDifferent chromatin states

Chromatin structure (and thus, gene expression) dependalso on the post-translational modifications associated with histones forming nuclesomes

““ChIP”ChIP”

If we have the “right” If we have the “right” antibody, we can extract antibody, we can extract (“immunoprecipitate”) (“immunoprecipitate”) from living cells the from living cells the protein of interest bound protein of interest bound to the DNAto the DNA

And - we can try to And - we can try to identify which were the identify which were the DNA regions bound by the DNA regions bound by the proteinprotein

Can be done for Can be done for transcription factorstranscription factors

But can be done also for But can be done also for histones - and separately histones - and separately for each modificationfor each modification

TF ChIPHistone ChIP

ChIP-Seq

Many cells-many copiesof the same region boundby the protein

After ChIPAfter ChIP

Identification of theDNA fragment bound

by the protein

Sequencing

Size selection: onlyfragments of the

“right size” (200 bp)are kept

So - if we foundthat a region hasbeen sequencedmany times, thenwe can suppose that it was bound by the protein, but…

Platform Roche(454) Solexa - Illumina Abi SOLiDSequencing Pyrosequencing By-synthesis Ligation-based Amplification Emulsion PCR Bridge amplification Emulsion PCRMb/run 100 Mb 1300 Mb 3000 MbTime/run 7 h 4 days 5 daysRead length 250 bp 32–40 bp 35 bpCost per run $8439 $8950 $17 447Cost per Mb $84.39 $5.97 $5.81

Only a short fragment of the extracted DNA region canbe sequenced, at either or both ends (“single” vs “paired end” sequencing)

for no more than 35 (before) / 50 (now) / 75 (now) bpsThus, original regions have to be “reconstructed”

…and, once again, bioinformaticians can be of help…

Read MappingRead Mapping

Each sequence read has to be assigned to Each sequence read has to be assigned to its original position in the genomeits original position in the genome

A typical ChIP-Seq experiment produces A typical ChIP-Seq experiment produces from 6 (before) to 100 million (now) reads from 6 (before) to 100 million (now) reads of 50-70 and more base pairs for each of 50-70 and more base pairs for each sequencing “lane” (Solexa/Illumina)sequencing “lane” (Solexa/Illumina)

Research in read alignment algorithms is Research in read alignment algorithms is booming (who is going to be the next booming (who is going to be the next BLAST?)BLAST?)

There exist efficient “sequence mappers” There exist efficient “sequence mappers” against the genome for NGS readagainst the genome for NGS read

Read Mapping “Typical” Read Mapping “Typical” OutputOutput

ID Sequence #0mm #1mm #2mm CHR HIT POS STR MM

>HWI-EAS413_4:1:100:825:1989CTAGAAGCAGAAGCAGGTATTTGGGGGGAGGGTTG R0 3 0 0>HWI-EAS413_4:1:100:1076:1671AACTGCTTTGAGATAGGGTCTCTCTTGTTCACTTT NM 0 0 0>HWI-EAS413_4:1:100:573:1957TCGAGACGTAAACTAGCTAACCTACATTATCCCCT NM 0 0 0>HWI-EAS413_4:1:100:1784:660AATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA R0 204 255 255>HWI-EAS413_4:1:100:133:987CGCGATGATGTCTCAATACACCCCCCCGCTACCAG NM 0 0 0>HWI-EAS413_4:1:100:1361:1636CATGTCATGCGCTCTAATCTCTGGGCATCTTGAGA NM 0 0 0>HWI-EAS413_4:1:100:1733:932CCGAACTTCTGACAGGTTTGAGCCTTCTGCTCAAG U1 0 1 0 chr9 110761807 F 13A>HWI-EAS413_4:1:100:992:1902CAATTAAATAATAATAAACTAACACACAATACAAA NM 0 0 0>HWI-EAS413_4:1:100:1230:1718TCAGCAAACAAACCCCCAACATAAAATCCATTATG NM 0 0 0>HWI-EAS413_4:1:100:324:130TCATCGAGAGGGGACTGAAGTGGAAGCTAGTCAGC U0 1 0 0 chr14 33191761 F

@12_10_2007_SequencingRun_3_1_119_647 (actual sequence)TTTGAATATATTGAGAAAATATGACCATTTTT+12_10_2007_SequencingRun_3_1_119_647 (“quality” scores)40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 39 27 40 40 4 27 40


Reads mapping more than once Reads mapping more than once (repetitive regions) (repetitive regions) can be discardedcan be discarded:: NeverNever, use all matches everywhere, use all matches everywhere If they map If they map more than a given maximum more than a given maximum

numbernumber of times of times If they If they do not map uniquelydo not map uniquely in the best in the best

matchmatch If they If they do not map uniquelydo not map uniquely with 0, 1, or with 0, 1, or

2 substitutions2 substitutions


Sequence quality tends to be lower toward the Sequence quality tends to be lower toward the 3’end of sequence reads3’end of sequence reads

Trick: if too few read map, “Trick: if too few read map, “trimtrim” the reads:” the reads: Map reads with standard parameters (two Map reads with standard parameters (two

substitutions will do)substitutions will do) Take all the reads that haven’t been mapped, and re-Take all the reads that haven’t been mapped, and re-

map them trimming away the first and the last map them trimming away the first and the last nucleotides nucleotides

Repeat until no significant improvement/increase in Repeat until no significant improvement/increase in mapped reads is obtainedmapped reads is obtained

DiscardDiscard reads mapping on different locations of the reads mapping on different locations of the genomegenome

““Peak finding”Peak finding”

The The critical partcritical part of any ChIP-Seq analysis is the of any ChIP-Seq analysis is the identification of the genomic regions that produced identification of the genomic regions that produced a a significantly high number of sequence reads, significantly high number of sequence reads, corresponding to the region where the protein corresponding to the region where the protein (nucleosome) of interest was bound to DNA(nucleosome) of interest was bound to DNA

Since a graphical visualization of the “piling” of read Since a graphical visualization of the “piling” of read mapping on the genome produces a “mapping on the genome produces a “peakpeak” in ” in correspondence of these regions, the problem is correspondence of these regions, the problem is often referred to as “peak finding”often referred to as “peak finding”

A “peak” then marks the region that was enriched in A “peak” then marks the region that was enriched in the original DNA samplethe original DNA sample


Peaks:How tall?

How wide?How muchenriched?


The main issue: the DNA sample sequenced The main issue: the DNA sample sequenced (apart from sequencing errors/artifacts) (apart from sequencing errors/artifacts) contains a lot of “noise”contains a lot of “noise” Sample “contamination” - the DNA of the PhD Sample “contamination” - the DNA of the PhD

student performing the experimentstudent performing the experiment DNA shearing is not uniform: open chromatin DNA shearing is not uniform: open chromatin

regions tend to be fragmented more easily and regions tend to be fragmented more easily and thus are more likely to be sequencedthus are more likely to be sequenced

Repetitive sequences might be artificially enriched Repetitive sequences might be artificially enriched due to inaccuracies in genome assemblydue to inaccuracies in genome assembly

Amplification pushed too much: you see a single Amplification pushed too much: you see a single DNA fragment amplified, not enrichedDNA fragment amplified, not enriched

As yet unknown problems, that anyway seem to As yet unknown problems, that anyway seem to produce “noisy” sequencings and screw the produce “noisy” sequencings and screw the experiment upexperiment up

ChIP-Seq histone dataChIP-Seq histone data

Histone modifications tend to be located at Histone modifications tend to be located at preferred locations with respect to gene preferred locations with respect to gene annotations/transcribed regionsannotations/transcribed regions

Hence, enrichment can be assessed in two Hence, enrichment can be assessed in two waysways Enrichment with respect a the control Enrichment with respect a the control

experiment and peak identificationexperiment and peak identification ““Local” enrichment in given regions with respect Local” enrichment in given regions with respect

to gene annotationsto gene annotations Promoters (active/non active)Promoters (active/non active) Upstream of transcribed/non transcribed genesUpstream of transcribed/non transcribed genes Within transcribed/not transcribed regionsWithin transcribed/not transcribed regions Enhancers, whatever elseEnhancers, whatever else

EsperimentoEsperimento

Eseguire una ChIP-Seq per diverse Eseguire una ChIP-Seq per diverse modificazioni istoniche, partendo da modificazioni istoniche, partendo da quelle più “classiche”quelle più “classiche”

Verificare:Verificare: Se ciascuna modifica ha una sua Se ciascuna modifica ha una sua

localizzazione “preferenziale” sul localizzazione “preferenziale” sul genoma o rispetto ai geni (es. nel genoma o rispetto ai geni (es. nel promotore, nella regione trascritta, etc.)promotore, nella regione trascritta, etc.)

Se ciascuna modifica è “correlata” in Se ciascuna modifica è “correlata” in qualche modo alla qualche modo alla trascrizione/espressione dei genitrascrizione/espressione dei geni

Genome wide histone Genome wide histone modifications maps through modifications maps through ChIP-SeqChIP-Seq Barski et.al - Barski et.al - CellCell 129 823-837, 2007 129 823-837, 2007 20 histone lysine and arginine methylations in CD4+ T 20 histone lysine and arginine methylations in CD4+ T

cellscells H3K27H3K27 H3K9H3K9 H3K36 H3K36 H3K79H3K79 H3R2 H3R2 H4K20 H4K20 H4R3 H4R3 H2BK5 H2BK5

Plus:Plus: Pol II bindingPol II binding H2A.Z (replaces H2A in some nucleosomes)H2A.Z (replaces H2A in some nucleosomes) insulator-binding protein (CTCF)insulator-binding protein (CTCF)

Genome wide histone Genome wide histone modifications maps through modifications maps through ChIP-SeqChIP-Seq


ChIP-Seq associata a una particolare modificazione ChIP-Seq associata a una particolare modificazione (es, H3K4me3)(es, H3K4me3)

Domanda: la modificazione è “correlabile” alla Domanda: la modificazione è “correlabile” alla trascrizione dei geni?trascrizione dei geni?

Ovvero, la modificazione “marca” particolari Ovvero, la modificazione “marca” particolari nucleosomi rispetto all’inizio della trascrizione, o nucleosomi rispetto all’inizio della trascrizione, o alla regione trascrittaalla regione trascritta

Esempio: potrebbero esserci modificazioni che:Esempio: potrebbero esserci modificazioni che: Marcano l’inizio della trascrizioneMarcano l’inizio della trascrizione Marcano tutta e solo la regione trascrittaMarcano tutta e solo la regione trascritta ““Silenziano” particolari loci genici impedendo la Silenziano” particolari loci genici impedendo la

trascrizionetrascrizione


Sequenze ottenute da ChIP-Seq per la Sequenze ottenute da ChIP-Seq per la modificazione studiatamodificazione studiata

Input: coordinate genomiche delle posizioni in Input: coordinate genomiche delle posizioni in ciascuna delle sequenze mappa (vedi file di ciascuna delle sequenze mappa (vedi file di esempio)esempio)

Input: coordinate genomiche dei geni RefSeq Input: coordinate genomiche dei geni RefSeq annotatiannotati

Un nucleosoma marcato dalla modificazione Un nucleosoma marcato dalla modificazione dovrebbe corrispondere a un “mucchietto” di dovrebbe corrispondere a un “mucchietto” di read che si sovrappongono (“picco”)read che si sovrappongono (“picco”)

Andiamo a contare, nucleosoma per Andiamo a contare, nucleosoma per nucleosoma, quanto alto è il “mucchietto”, nucleosoma, quanto alto è il “mucchietto”, ovvero quanti read sono associabili al ovvero quanti read sono associabili al nucleosomanucleosoma

Nucleosoma

Esempio: se si trovasse la modifica nel nucleosoma a montedel TSS dei geni trascritti, troveremmo un “mucchietto” così

Modificazione

Nucleosoma

Esempio: se si trovasse la modifica nei nucleosomi associati alle regioni trascritte, troveremmo “mucchietti” così

Modificazione

Analisi: primo esempioAnalisi: primo esempio

InputInput Lista ordinata delle coordinate genomiche dei Lista ordinata delle coordinate genomiche dei

TSS associati ai geni trascrittiTSS associati ai geni trascritti Lista ordinata delle coordinate genomiche dei Lista ordinata delle coordinate genomiche dei

TSS associati ai geni NON trascrittiTSS associati ai geni NON trascritti Lista ordinata delle coordinate genomiche dove Lista ordinata delle coordinate genomiche dove

mappa ciascuna sequenza della ChIP-Seqmappa ciascuna sequenza della ChIP-Seq Output: calcolare la distribuzione (i “mucchietti”) Output: calcolare la distribuzione (i “mucchietti”)

rispetto ai TSS delle due categorie:rispetto ai TSS delle due categorie: Geni trascrittiGeni trascritti Geni NON trascrittiGeni NON trascritti

TSS

-1000 +1000

Dato ciascun TSS, calcolare quante sequenze mappano tra -1000 e +1000 bp rispetto al TSSContare quante sequenze mappano a -1000, -999, -998...-1,0+1,+2,...+998,+999,+1000Sommare per tutti i TSS i conteggi a ciascuna distanza (-1000, -999, -998,...,-1,0,+1,+2,...+998,+999,+1000)

Algoritmo!Algoritmo!

TSS

-1000 +1000

Attenzione!Attenzione!

TSS

+1000 -1000

Le coordinate rispetto al TSS dipendono dalla direzione della trascrizione!!

Output: histone modifications Output: histone modifications at TSSat TSS

0 +1000-1000

Distance from TSS

Rea

d co

unt (

peak

hei

ght)

I risultati!I risultati!

PolII is found bound to DNA at the TSS of transcribed genes

H3K4me3 is found just before and after the TSS of transcribed genes

H3K4me2 (not me3!) is found just before and after the TSS of transcribed genes,but farther away than H3K4me3

H3K4me1 is found just before and after the TSS of transcribed genes,but farther away than H3K4me3 and H3K4me2

H3K27me3 covers the whole locus of “silent” genes - no transcription here

H3K27me1 (not me3!) is vice versa associated before and after loci oftranscribed genes

H3K36me3 is found within the transcribed region - a bit downstream of the TSS -as if it “lets” polymerase proceed with transcription

H3K9me1 is similar in profile to H3K4me3

Barski et. al. High-Resolution Profiling of Histone Methylations in the Human Genome, Cell 129(4)

Histone modifications at Histone modifications at transcribed regionstranscribed regions

Expression level

Rea

d co

unt (

peak

hei

ght)

High Low

bioinformatic analysis of chromatin genomic data giulio pavesi university of milano...

Documents

protein slide

modification slide

group slide

nuclesomes slide

read map

bp long slide

end of sequence

mapping reads