introduction to human genomics and genome informatics · head, bioinformatics & integrative...
TRANSCRIPT
![Page 1: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/1.jpg)
Introduction to human genomics and genome informatics
Prince of Wales Clinical School
Session 1
Dr Jason Wong
ARC Future Fellow
Head, Bioinformatics & Integrative Genomics
Adult Cancer Program, Lowy Cancer Research Centre
Prince of Wales Clinical School, Faculty of Medicine, UNIVERSITY OF NEW SOUTH
WALES, SYDNEY NSW 2052
![Page 2: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/2.jpg)
What we will cover
• Structure of the human genome
• Layers of genomic information – DNA (Sequence variation) – RNA (Genes & gene expression) – Epigenetics (DNA methylation) – Epigenetics (Histone code/Transcription factors)
• Genomic data acquisition technologies – Microarray – Next-generation sequencing
![Page 3: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/3.jpg)
Structure of human genome
• Consist of 23 pairs of chromosomes.
• Each chromosome is paired meaning that it is diploid.
• Each individual chromosome made up of double stranded DNA.
• Approximately ~3 billion bases in total.
![Page 4: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/4.jpg)
Reference human genome
• Human genomes vary significantly between individuals (~0.1%)
• Computationally, a reference genome is used.
• Important things to note about the reference genome: – Is haploid (i.e. only 1 sequence)
– Is a composite sequence (i.e. does not correspond to anyone’s genome)
![Page 5: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/5.jpg)
Representation of genomic data
• Genomic data is most common represented in two ways:
1. Sequence data – fasta format (.fa or .fasta)
2. Location data – bed format (.bed)
>chr1
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ACAGTACTGGCGGATTATAGGGAAACACCCGGAGCATATGCTGTTTGGTC
TCAgtagactcctaaatatgggattcctgggtttaaaagtaaaaaataaa
tatgtttaatttgtgaactgattaccatcagaattgtactgttctgtatc
ccaccagcaatgtctaggaatgcctgtttctccacaaagtgtttactttt
....
chr1 934343 935552 HES4 0 -
chr1 948846 949919 ISG15 0 +
...
All about genomic formats here - http://genome.ucsc.edu/FAQ/FAQformat.html
chromosome start end name score strand
![Page 6: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/6.jpg)
What do chromosomes contain?
Genes: ~1.2% coding ~2% non-coding
Regulatory regions: ~2%
Repetitive elements comprise another ~50% of the human genome
![Page 7: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/7.jpg)
Layers of genetic information
• DNA sequence variation
• Gene expression – Coding
– Non-coding
• Epigenetic regulation – DNA methylation
– Histone/transcription factor binding
![Page 8: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/8.jpg)
Sequence variation
![Page 9: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/9.jpg)
Variations in DNA sequence
• Cytological level: – Chromosome numbers – Segmental duplications, rearrangements,
and deletions
• Sub-chromosomal level: – Transposable elements – Short Deletions/Insertions, Tandem repeats
• Sequence level: – Single Nucleotide Polymorphisms (SNPs) – Small Nucleotide Insertions and Deletions
(Indels)
![Page 10: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/10.jpg)
Sequence variation
• Single nucleotide polymorphisms (SNPs) – DNA sequence variations that
exist with members of a species. – They are inherited at birth and
therefore present in all cells.
• Somatic mutations – Are somatic – i.e. only present
in some cells – Mutations are often observed in
cancer cells
![Page 11: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/11.jpg)
Types of SNPs/Mutations
• Most SNPs and mutations fall in intergenic regions.
• Within genes, they can either fall in the non-coding or coding regions.
• Within coding regions, they can either not-change (synonymous) or change (non-synonymous) amino acids.
Intergenic region Non-coding
Synonymous Coding
Non-Synonymous TSS
![Page 12: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/12.jpg)
Effects of sequence variation
• Non-synonymous variants: – Missense (change protein structure)
– Nonsense (truncates protein)
• Synonymous or non-coding variants: – Alter transcriptional/translational efficiency
– Alter mRNA stability
– Alter gene regulation (i.e. alter TF binding)
– Alter RNA-regulation (i.e. affect miRNA binding)
Majority of sequence variation are neutral
![Page 13: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/13.jpg)
Genes and gene expression
![Page 14: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/14.jpg)
• A gene is a functional unit of DNA that is transcribed into RNA.
• Total genes in the human genome – 57,445
Types of genes
Source: GENCODE (version 18)
![Page 15: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/15.jpg)
Coding genes
Source: http://www.news-medical.net
• Traditionally considered to be the most important functional unit of genomes.
• ~ 20,000 in the human genome.
• Due to splicing one gene can make many proteins.
![Page 16: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/16.jpg)
Non-coding genes
![Page 17: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/17.jpg)
microRNA • Plays a role in post-
transcriptional regulation.
• Only discovered in 1993.
• Acts by either causing RNA degradation or inhibition of translation.
• Implicated in many aspects of health and disease including: – Development – Cancer – Heart disease
![Page 18: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/18.jpg)
Long non-coding RNA (lncRNA)
• Arbitrarily defined as non-coding transcripts > 200 nt in length.
• Implicated in many functions including: – Altering protein/DNA
interaction. – Binds mRNA. – Sink for miRNAs. – Etc…
• Unlike coding and miRNAs, lncRNA are less conserved and function of many are unknown.
Prensner and Chinnaiyan (2011) Cancer Discov. 1:391
![Page 19: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/19.jpg)
Gene expression
• Measuring the level of RNA (typically mRNA) in the sample.
• Generally microarray- or sequencing-based.
• Commonly used for measuring differential expression – between samples, or – between genes
• Computation analysis and
normalisation of expression data can be complicated.
Source: OPENbeta
![Page 20: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/20.jpg)
Gene-set/Pathway analysis
• Differential expression of individual genes not necessarily informative.
• Genes are often grouped in gene-sets based on ontology or biological pathways.
• Gene-set and pathway analyses are therefore a common downstream after differential gene expression analysis.
![Page 21: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/21.jpg)
Gene regulation
![Page 22: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/22.jpg)
Gene regulation/epigenetics
• Epigenetics is the study of mechanisms that alter cellular function independent to any changes in DNA sequence
• Mechanisms include: – DNA methylation
– Nucleosome positioning/Histone modification
– Transcription factors
– Non-coding RNA
![Page 23: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/23.jpg)
DNA methylation
• DNA is methylated on cytosines in CpG dinucleotides
![Page 24: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/24.jpg)
Nucleosomes & Histones
• Histones are proteins that package DNA into nucleosomes.
![Page 25: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/25.jpg)
Histone modifications
• Acetylation
• Methylation
• Phosphorylation
• Ubiquitination
• Can enhance or repress gene expression
![Page 26: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/26.jpg)
Transcription factors
• Proteins that bind DNA to regulate gene expression.
• Typically binds at gene promoters or enhancers.
![Page 27: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/27.jpg)
Studying gene regulation
• Has traditionally been more difficult than studying gene expression because:
– Location of many regulatory regions are poorly defined.
– Regulatory regions differ greatly between cell types.
– Many modes of gene regulation.
• Next-generation sequencing technologies has enabled great progress to be made.
![Page 28: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/28.jpg)
Genomic technologies
![Page 29: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/29.jpg)
Genomic technologies
• Microarray-based data
– SNP profiling
– Copy number profiling
– DNA methylation profiling
– Gene expression profiling
• Next-generation sequencing
– “Swiss-army knife” of genomics
![Page 30: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/30.jpg)
Data acquisition
• Relies on fluorescence-based on hybridisation of DNA against complementary probe on array.
• Can be used to study DNA or any molecule that can be converted to cDNA. – SNP array (probe for two alleles)
– Methylation array (probe for bisulfide converted DNA)
– Expression array (probe for exonic DNA regions)
• Limited by probes present on the array.
![Page 31: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/31.jpg)
Microarray gene expression analysis
•Gene signatures
• Sample classification
Gene Value
D26528_at 193
D26561_cds1_at -70
D26561_cds2_at 144
D26561_cds3_at 33
D26579_at 318
D26598_at 1764
D26599_at 1537
D26600_at 1204
D28114_at 707
C la s s S n o D 2 6 5 2 8 D 6 3 8 7 4 D 6 3 8 8 0 …
A L L 2 1 9 3 4 1 5 7 5 5 6
A L L 3 1 2 9 1 1 5 5 7 4 7 6
A L L 4 4 4 1 2 1 2 5 4 9 8
A L L 5 2 1 8 8 4 8 4 1 2 1 1
A M L 5 1 1 0 9 3 5 3 7 1 3 1
A M L 5 2 1 0 6 4 5 7 8 9 4
A M L 5 3 2 1 1 2 4 3 1 2 0 9
…
Data Mining
and analysis
Microarray chips Images scanned by laser
Datasets
![Page 32: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/32.jpg)
Next-generation sequencing
![Page 33: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/33.jpg)
What is NGS?
A number of different technologies. We use the technology by Illumina sequencers as an example.
Figures provided by Illumina Inc.
![Page 34: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/34.jpg)
Sequences are inferred from fluorescence signals during synthesis
Figures provided by Illumina Inc.
Short sequencing reads
![Page 35: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/35.jpg)
Aligned reads
Gene
Alignment
![Page 36: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/36.jpg)
NGS file formats
• Fastq – Stores sequencing reads from NGS. Contains read sequence and quality scores.
• BAM/SAM – A BAM file (.bam) is a binary file containing coordinates of where a read has mapped to in a genome. SAM is the same file in text format
• BedGraph/Wig – for storing continuous profile
information for visualisation.
• VCF – for storing information about variants.
https://powcs.med.unsw.edu.au/sites/default/files/powcs/page/example_file_formats.zip
![Page 37: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/37.jpg)
Pros/cons of each technology
• NGS – Greater dynamic range (only limited by depth of
sequencing)
– Coverage of genome does not need to be limited.
– Many more applications from sequencing data.
– Data analysis and management can be challenging.
• Microarrays – Microarrays are still significantly cheaper.
– Largest public datasets are likely to be microarray based.
– Data analysis pipelines are well standardised.
![Page 38: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/38.jpg)
Example of using public resources to tell us more about our data
http://www.powcs.med.edu.au/OncoCis
![Page 39: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/39.jpg)
OncoCis uses public data from various sources to assign potential function to non-coding mutations
Given a non-coding mutation what do we want to know? 1. Does the mutation fall within a cis-
regulatory region (ENCODE/Human Epigenome Atlas).
2. Is the mutation site highly conserved (UCSC)?
3. What gene might the mutation affect (FANTOM5)?
4. What transcription factor binding site might be altered (JASPAR)?
5. Does the mutation affect a gene which is druggable (DGIdb)?
![Page 40: Introduction to human genomics and genome informatics · Head, Bioinformatics & Integrative Genomics Adult Cancer Program, Lowy Cancer Research Centre Prince of Wales Clinical School,](https://reader034.vdocument.in/reader034/viewer/2022043005/5f8d22ea10823c23571f8333/html5/thumbnails/40.jpg)
Gene mapping from FANTOM5 or GREAT Link out to UCSC
genome browser
Epigenetic data from ENCODE/Epigenome project
Conservation data from UCSC
Motif data from JASPAR
FANTOM5 regulatory data
Link out to Drug-Gene interaction database (DGIdb)