analysis of massively parallel sequencing data – application of … · 2017. 1. 29. · gordon...

17
www.sourcebioscience.com 1 Analysis of Massively Parallel Sequencing Data – Application of Illumina Sequencing to the Genetics of Human Cancers Gordon Blackshields Senior Bioinformatician Source BioScience

Upload: others

Post on 09-Feb-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

  • www.sourcebioscience.com1

    Analysis of Massively Parallel Sequencing Data – Application of Illumina Sequencing to

    the Genetics of Human Cancers

    Gordon BlackshieldsSenior Bioinformatician

    Source BioScience

  • www.sourcebioscience.com2

    Next Generation Sequencing ApplicationsTo Cancer Genetics Studies

    Introduction

    “Next Generation Sequencing” (NGS) on Illumina platform is suitable for clinical applications that require large amounts of information, accurate quantification and high-sensitivity detection

    – Mutation detection in tumours (from biopsies / circulating tumour cells (CTC)). – Pathogen detection e.g. organism identification for epidemiological investigations– Gut microbial flora genomics – Detection of the presence of antibiotic resistance genes– Comparison of novel sequences / genes to those in public databases

  • www.sourcebioscience.com3

    Applications of NGS to Cancer GeneticsSome Commonly Applied Techniques

    “Sequencing The Genome”Reference alignment, targeted resequencing for polymorphism and mutation discoveryDe novo assembly for characterisation of novel genes, genomes.Paired-end sequencing highlights larger structural variants (inherited/acquired)

    “Sequencing The Transcriptome”RNA-Seq allows “absolute” quantification of gene expression across transcriptomeNo prior knowledge of content needed – quantify expression of ‘unknown’ genes Profiling of mRNA, ncRNA, miRNA…

    “Sequencing The Cistrome”ChIP-Seq allows profiling of cis-acting targets (DNA binding sites) of a trans-acting factor (transcription factor, restriction enzyme, etc) on a genome scale. Determine how proteins interact with DNA to regulate gene expressionDetermine how TFs and other proteins influence phenotype-affecting mechanismsSImilar approach can be used to characterise genomic methylation patterns – “the methylome”

  • www.sourcebioscience.com4

    Applications of NGS to Cancer GeneticsLevels of information extraction, data integration

    Density on known exons Novel Transcripts

    Enriched regions

    Binding Sources

    Motif finding

    Associated GenesDifferential Expression

    Expression levels Novel gene models

    Associate observed variants with regulation/transcriptional changes; link to external databases

    Consensus Sequence

    Targeted Resequencing

    Variant Detection

    Overlapping Genes

    Integrate

    Analyse

    Identify

    De novo assembly / reads mapped to (un) annotated reference sequenceProcess

    108-109 short DNA fragmentsGenerate

    Lev

    el o

    f In

    form

    atio

    n E

    xtra

    ctio

    n

    Variant Detection

    ChIP-Seq

    Novel Isoforms

    RNA-SeqQuantification

    RNA-SeqDiscovery

    Identify splice-crossing reads

  • www.sourcebioscience.com5

    Next Generation Sequencing ApplicationsHuman Resequencing and Variant Detection

    Reference Assembly, Targeted ResequencingAnd Variant Detection

    Search for alterations at nucleotide level to explain changes in regulation/transcription

    Single Ended (SE) sequencing• ~85% of complex genome accessible• suitable for SNPs, small indels (DIPs)

    Paired-Ended (PE) sequencing• ~99% of complex genome accessible• Find longer DIPs• Find larger structural variations• Span repeat regions

  • www.sourcebioscience.com6

    Next Generation Sequencing ApplicationsHuman Resequencing and Variant Detection

    2009 Nature Paper

    Cytogenetically normal AML genome sequenced (32x)Comparison with matched normal tissue (14x)

    98 full runs on Illumina GA to achieve required depth

    Alignment, variant discovery performed by MAQ

    97.7% of variants in AML genome also in normal Further restricted to annotated gene-coding regions

    Across all tumour cells:found 10 genes with acquired mutations (8 novel )present in all cells at presentation and relapse

    “Our study establishes whole genome sequencing as an unbiased method for discovering initiating mutations in cancer genomes, and for identifying novel genes that may respond to targeted therapies”

  • www.sourcebioscience.com7

    Next Generation Sequencing ApplicationsPolymorphism detections within P53

    P53 Variant detection study• “Guardian of the genome” (Lane, 1992)• Protects fidelity of DNA replication• Directs cell arrest/apoptosis when stressed

    • Mutated in more than half of human cancers

    http://p53.free.fr/

  • www.sourcebioscience.com8

    Next Generation Sequencing ApplicationsPolymorphism detections within P53

    17p13.1P53 Variant detection study• “Guardian of the genome” (Lane, 1992)• Protects fidelity of DNA replication• Directs cell arrest/apoptosis when stressed

    • Mutated in more than half of human cancers• Human TP53 gene located on 17p13.1• Region sometimes deleted in human cancer

  • www.sourcebioscience.com9

    PCR amplification

    Next Generation Sequencing ApplicationsPolymorphism detections within P53

    17p13.1P53 Variant detection study• “Guardian of the genome” (Lane, 1992)• Protects fidelity of DNA replication• Directs cell arrest/apoptosis when stressed

    • Mutated in more than half of human cancers• Human TP53 gene located on 17p13.1• Region sometimes deleted in human cancer

    Study• Search for variants on P53 gene in matched

    tumour samples. • Use gene specific PCR to amplify exons only to

    maximise depth of coverage

  • www.sourcebioscience.com10

    Next Generation Sequencing ApplicationsPolymorphism detections within P53

    Cov

    erag

    e pe

    r ba

    se p

    ositi

    on

    35000

    30000

    25000

    20000

    15000

    10000

    5000

    Gene position

    12000 13000 14000 15000 16000 17000 18000 19000

    Coverage of p53 geneP53 Variant detection study• “Guardian of the genome” (Lane, 1992)• Protects fidelity of DNA replication• Directs cell arrest/apoptosis when stressed

    • Mutated in more than half of human cancers• Human TP53 gene located on 17p13.1• Region sometimes deleted in human cancer

    Study• Search for variants on P53 gene in matched

    tumour samples. • Use gene specific PCR to amplify exons only to

    maximise depth of coverage

    • Use MAQ for alignment, variant discovery against P53 reference gene

    • Comparison with results from 454, Sanger

  • www.sourcebioscience.com11

    Next Generation Sequencing ApplicationsPolymorphism detections within BRCA1

    BRCA1 Variant detection study• Human tumour suppressor gene • Primarily expressed in breast tissue• Helps repair damaged DNA (if possible)

    • Mutations to BRCA1 allow uncontrolled replication of damaged cells.

  • www.sourcebioscience.com12

    Next Generation Sequencing ApplicationsPolymorphism detections within BRCA1

    CASAVADemultiplex (11 samples)Map reads to ref (BRCA1)

    SAMToolsConversion to SAM format

    Conversion to Pileup formatConsensus/Indel Calling

    Filter for variants

    Comparison with Known variants

    BRCA1 Variant detection study• Human tumour suppressor gene • Primarily expressed in breast tissue• Helps repair damaged DNA (if possible)

    • Mutations to BRCA1 allow uncontrolled replication of damaged cells.

    Pilot Study• Search for variants on BRCA1 gene • Use gene specific PCR to amplify exons only to

    maximise depth of coverage• Multiplexed – 11 samples loaded into one lane• Use CASAVA for de-multiplexing, alignment• Use SAMtools for consensus/indel calling, filtering• Validation of results against known variants.

  • www.sourcebioscience.com13

    Next Generation Sequencing ApplicationsRNA-Seq: Transcriptome Analysis

    RNA-Seq• Sequence RNA (translated to cDNA) • Mapped to annotated reference genome

    (annotated genes, known variants)• Expression levels deduced from total

    number of reads that map to exons of a gene.

    RNA-Seq versus Microarray• More sensitive to low-abundance transcripts• “absolute” gene expression levels detectable

    – can detect single molecules • no prior knowledge required of content• Greater ability to distinguish isoforms• Ability to determine allelic expression• Less biased

  • www.sourcebioscience.com14

    DAVID – Pathways Analysis of deregulated geneshttp://david.abcc.ncifcrf.gov/

    DESeq – Differential Gene Expression of RNA-Seq data

    BOWTIE – Maps reads to reference genome (hg19)TOPHAT – Identifies splice sites (known/novel)CUFFLINKS – Transcript Assembly, Quantification

    Next Generation Sequencing ApplicationsRNA-Seq: Transcriptome Analysis

    RNA-Seq Study of ovarian cancer cell lines• Identification of changes in gene expression in strains with

    acquired drug-resistance

    • Special interest in ncRNA expression data

    • Use Bowtie and Tophat to map reads, identify splice sites

    • Use Cufflinks to assemble transcripts, calculate abundances• ~87% of reads mapped to genome

    • Use DESeq to perform differential expression tests

    • Use DAVID (Database for Annotation, Visualisation and Integrated Discovery (http://david.abcc.ncifcrf.gov/)) for pathway analysis

    – Found significant representation of cancer pathways and focal adhesion genes

  • www.sourcebioscience.com15

    Next Generation Sequencing ApplicationsChIP-Seq: Genome-wide protein-DNA interactions

    ChIP-Seq• Chromatin-immunoprecipitation (ChIP) isolates protein-

    bound DNA• Follow by deep sequencing of DNA fragments (Seq)• Facilitates genome wide mapping pf DNA-protein

    interactions• How TFs, other chromatin associated factors can affect

    phenotype. • Regulation/Structural Analysis

    ChIP-Seq vs. ChIP-chip• no prior knowledge of content required• Similar approach can be used to map genomic methylation

  • www.sourcebioscience.com16

    Next Generation Sequencing ApplicationsChIP-Seq: Genome-wide protein-DNA interactions

    ChIP-Seq Study ofHaematopoietic Stem Cells

    • Interest in Haematopoiesis and genetic circuitry of blood cell development

    • Tal1 – T-cell acute lymphocytic leukaemia protein 1• TF that controls development and differentiation

    of Haematopoietic Stem Cells (HSCs)• Very few target genes had been validated.

    • ChIP-Seq approach taken to generate a genome-wide catalogue of Tal1 binding events in stem cell line

    • Use Illumina BeadStudio ChIP-Seq module to identify peaks (potential chromatin binding sites)

    • Followed by in vivo validation (foetal liver, transgenic mice)

    • Allows construction of in vivo validated network of 17 factors and respective regulatory elements

  • www.sourcebioscience.com17

    Applications of NGS to Cancer GeneticsLevels of information extraction, data integration

    Density on known exons Novel Transcripts

    Enriched regions

    Binding Sources

    Motif finding

    Associated GenesDifferential Expression

    Expression levels Novel gene models

    Associate observed variants with regulation/transcriptional changes; link to external databases

    Consensus Sequence

    Targeted Resequencing

    Variant Detection

    Overlapping Genes

    Integrate

    Analyse

    Identify

    De novo assembly / reads mapped to (un) annotated reference sequenceProcess

    108-109 short DNA fragmentsGenerate

    Lev

    el o

    f In

    form

    atio

    n E

    xtra

    ctio

    n

    Variant Detection

    ChIP-Seq

    Novel Isoforms

    RNA-SeqQuantification

    RNA-SeqDiscovery

    Identify splice-crossing reads