a method for high throughput sequencing data analysis...

18
A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites (ChIPseq) JC Andrau, Biostat, 15/01/2010 1 2 3 7 8 9 4 5 6 T G C T A C G A T

Upload: others

Post on 09-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

A method for high throughput sequencing data analysis: application for mapping genome-wide protein-DNA binding sites

(ChIPseq)

JC Andrau, Biostat, 15/01/2010

1 2 3 7 8 94 5 6

T G C T A C G A T

Page 2: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

High thoughput sequencing applications

Epigenetic marks mapping and identification of regulatory sequences of gene expression (ChIP-seq)

Protein-DNA interaction

Genomesequencing

Human gene mapping

Qualitative (SNP) and quantitative (amplification) genetic variations

de novo sequencing of model organisms and pathogens

Transcriptome(RNAseq)

Identification and analysis of non coding RNAs (miRNA, etc.)

Monitoring gene expression in covering all the alternative messengers to a given locus in a variety of contexts

Page 3: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

ChIP-seq: Solexa procedure

PCR + size exclusion(gel extraction)

Loading in flowcell and cluster amplification

Image acquisition and base calling

Page 4: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Sequencing and alignment

• Sequencing extremities of DNA fragments

• RAW data files (sequences)

• Aligned against a reference genome

– MAQ

– Solexa…

Page 5: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

First steps of data analysis

Page 6: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

First steps of data analysis

Page 7: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

DNA fragments VS sequences

• Only extremities of DNA fragments are sequenced

• Enriched regions don’t represent exact binding site

• In-silico process to elongate the tags

+

Strand

-

Strand

Binding

Site

Page 8: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Elongation process

Strand +

Strand -

Shifting (bp)

Ove

rla

p

Page 9: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Score per nucleotide

Page 10: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Score per nucleotide

Page 11: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Further analysis

Page 12: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Artefacts removal and normalisation

• An input experiment helps to localize problematic regions in alignment (duplications, reference genome…)

– We shouldn’t see enrichment in input

– These regions were removed from all datasets

• Based on the average of the scores in the whole genome, we can estimate the BG level and then rescale all experiments according to this level

• Last step consists of subtracting the input from the datasets in order to reduce the variations effects and the background in the data

Page 13: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Pipeline for ChIPseq data Analysis

- ChIP, QCs, sequencing and original file genesis- Alignment against a reference genome (Eland)

Conversion to gff format in RArtefact and multiple matches

removal

Elongation of tags, merge of

both strands and data bining

Input or mock data set substraction,

data normalisation

Data analysis and visualisation

Page 14: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

ChIPseq and ChIP-on-Chip

Page 15: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Recruitment

CTD phosphorylations and transcription

The CTD is a heptapeptide repetition (Y S P T S P S)n of the largest Pol II subunit conserved from yeast (26x) to human (52x). ?

Initiation Elongation (productive)

Page 16: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

Core et al, Science 2008

TSS profiling of CTD and S5P overlaps with sense/antisensetranscription

Bin

din

g le

ve

l

Pol II Binding around TSSs

Page 17: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

K mean clustering of top 20% Pol II S5-P

1

2

3

4

5

6

7

8

9

1 2 3 4

5 6 7

8 9 10

10

Right

to

TSS

Centered

Left

to

TSS

Clustering indicates several populations of initiating Pol II around TSS

Page 18: A method for high throughput sequencing data analysis ...iml.univ-mrs.fr/sta/SMPGD2010/slides/SMPGD10-Andrau-slide.pdf · A method for high throughput sequencing data analysis: application

PF lab, CIML MarseilleRomain FenouilFred KochPierre CauchyPierre Ferrier

CNG EvryIvo GutMarta Gut

GSF Cancer Institute, MunichDirk EickMartin HeidemannCorinna Hintermair

Many thanks to…