1 mbg-487 microarrays - i 2 human genome project
TRANSCRIPT
1
MBG-487
Microarrays - I
2
HUMAN GENOME PROJECT
3
Knowledge about the effects of DNA variations among
individuals can lead to revolutionary new ways to
diagnose, treat, and someday prevent the thousands
of disorders that affect us. Besides providing clues to
understanding human biology, learning about
nonhuman organisms' DNA sequences can lead to an
understanding of their natural capabilities that can be
applied toward solving challenges in health care,
agriculture, energy production, environmental
remediation, and carbon sequestration.
What are some practical benefits to learning about DNA?
4
Genome
• The complete complement of an organism’s genes; an organism’s genetic material.
5
•identify all the approximately 20,000-25,000 genes in
human DNA,
•determine the sequences of the 3 billion chemical base
pairs that make up human DNA
•store this information in databases,
•improve tools for data analysis,
•transfer related technologies to the private sector, and
•address the ethical, legal, and social issues (ELSI) that
may arise from the project.
GOALS OF HUMAN GENOME PROJECT
6
Drosophila melanogaster
Caenorhabtitis elegans
Arabidopsis thaliana
Saccharomyces cerevisiae
E. coli
Mus musculus
Bacteriophage
Fugu rubripes
Homo sapiens
7
Genome sequencing helps in:• identifying new genes (“gene discovery”) • looking at chromosome organization and structure• finding gene regulatory sequences• comparative genomics
These in turn lead to advances in: •medicine•agriculture•biotechnology •understanding evolution and other basic science questions
8
Some current and potential applications of genome
research include:
• Molecular medicine• Energy sources and environmental applications• Risk assessment• Bioarchaeology, anthropology, evolution, and human
migration• DNA forensics (identification)• Agriculture, livestock breeding, and bioprocessing
9
Molecular Medicine
• Improved diagnosis of disease
• Earlier detection of genetic predispositions to disease
• Rational drug design
• Gene therapy and control systems for drugs
• Pharmacogenomics "custom drugs"
10
Bioarchaeology, Anthropology, Evolution, and
Human Migration
•Study evolution through germline mutations in lineages
•Study migration of different population groups based
on female genetic inheritance
•Study mutations on the Y chromosome to trace
lineage and migration of males
•Compare breakpoints in the evolution of mutations
with ages of populations and historical events
11
• Understanding genomics will help us understand human
evolution and the common biology we share with all of
life.
• Comparative genomics between humans and other
organisms such as mice already has led to similar genes
associated with diseases and traits.
• Further comparative studies will help determine the yet-
unknown function of thousands of other genes.
12
Genes (i.e., protein coding)
But. . . only <2% of the human genome encodes proteins
Other than protein coding genes, what is there?• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)• structural sequences (scaffold attachment regions)• regulatory sequences• “junk” (including transposons, retroviral insertions, etc.)
It’s still uncertain/controversial how much of the genome is composed of any of these classes
The answers will come from experimentation and bioinformatics.
What’s in a genome?
13
Why sequence is not enough
• Identifying genes and control regions is not enough to decipher the inner workings of the cell:
• We need to determine the function of genes.
• We would like to determine which genes are activated in
which cells and under which conditions.
• We would like to know the relationships between genes
(protein-DNA, protein-protein interactions etc.).
• We would like to model the various dynamic systems in
the cell.
14
• transcription• post transcription (RNA stability)
• post transcription (translational control)• post translation (not considered gene regulation)
usually, when we speak of gene regulation, we are referring to transcriptional regulation
the “transcriptome”
Genes can be regulated at many levels
RNA PROTEINDNATRANSCRIPTION TRANSLATION
The “Central Dogma”
15
• high throughput assays
• robotics
• high speed computing
• statistics
• bioinformatics
Because of the vast amounts of data that are generated, we need new approaches
16
High-throughput Technologies and ‘OMİKS’ Science
17
Terms and Definitions
Genomics
Analysis of an organisms genome – identification of single genes and their function
Functional genomics
Global and dynamic survey of gene expression; detection of functional relationship
Proteomics
Analysis of protein-sequences, expression-patterns and protein-interactions of a given organism
Bioinformatics
computer-aided processing of biological data detection of complex interrelations interpretation and conclusion structuring, saving, search
18
Functional genomics
The ability to perform genome-wide patterns of gene expression and the mechanisms by which gene expression is coordinated.
19
Functional Genomics
20
Functional Genomics
21
High-throughput analysis
22
Idea: finding which genes are expressed by measuring the mRNA amount in the cell (or other materials).
Finding gene expression
23
Microarrays can show us when
and where genes are expressed.
But what regulates this
expression?
24
One way of looking at the transcriptome is with DNA microarrays. With microarrays, the expression of thousands of genes can be assessed in a single experiment.
cDNAs or oligonucleotides representing all genes in the genome are deposited on a glass slide using a robotic arrayer:
Looking at the transcriptome: DNA
microarrays
Benfey, P. and Protopapas, A. Genomics. 2005. New Jersey: Pearson Prentice Hall. pp. 131-2
25
26
Why use microarrays?
•Each cell type expresses ~ 10- 20 000 genes
•Physiological and pathophysiological responses are
linked to changes in gene expression
•Knowledge of gene expression variation at different
states may create new hypotheses about gene
function and underlying mechanisms
27
Microarray Technology
• Microarray:– New Technology (first paper: 1995)
• Allows study of thousands of genes at same time
– Glass slide of DNA molecules • Molecule: string of bases (25 bp – 500 bp) • uniquely identifies gene or unit to be studied
http://kbrin.a-bldg.louisville.edu/CECS694/
28
Fabrications of Microarrays
• Size of a microscope slide
Images: http://www.affymetrix.com/
29
Differing Conditions
• Ultimate Goal:– Understand expression level of genes under
different conditions
• Helps to:– Determine genes involved in a disease– Pathways to a disease– Used as a screening tool
30
Gene Conditions
• Cell types (brain vs. liver)
• Developmental (fetal vs. adult)
• Response to stimulus
• Gene activity (wild vs. mutant)
• Disease states (healthy vs. diseased)
31
Expressed Genes
• Genes under a given condition– mRNA extracted from cells– mRNA labeled– Labeled mRNA is mRNA present in a given
condition– Labeled mRNA will hybridize (base pair)
with corresponding sequence on slide
32
33
34
35
36
37
38
39
40
DNA microarray ProbesProduction of high-density DNA microarrays is complex and requires:
-sequence information of the organism-gene transcript analysis-gene clustering and annotation-probe design
cDNA: reverse-transcribed from cellular mRNA populationcDNA libraries (~105 clones) represent a snapshot of cellular gene expression.
PCR-samples for probe generation (300-800 nt)amplified DNA needs purification from enzymes, nucleotidessuch contaminants can interfere with the microarray analysis
Oligo-nucleotides: 50-70, multiple 25merless time and effort; precision
surface chemistry: to facilitate the attachment of probes to the slide
41
Chip design and content
standard size: 1“ x 3“ (2.54 x 7.62 cm) glass slideDNA fragments (corresponding to a particular gene) are spottedonto the array’s surface along a defined gridspot size: ~100μm/ >20.000 individual samples
Microarray platformsfull genome chips
Affymetrix: Gene Chips: A,B,C sets: in situ25mer Oligos, 16 probes/geneone color, biotinylated targets,post labeling with SA-PEclosed system
Agilent:22k, 44k60mer Oligos, 1 probe/geneTwo color labeling (Cye dyes)open source
42
MIAME - Minimum Information About a Microarray Experiment
• -to enable the interpretation of the results• -to potentially reproduce the experiment verify the conclusions• -to make microarray data available to the scientific community
MIAME principlesExperiment Design
– The goal of the experiment– Keywords - e.g. time course, cell type comparison– Experimental factors - parameters or conditions tested
Samples used, extract preparation and labeling– The origin of each biological sample– Manipulation of samples and protocols used
Hybridization procedures and parameters-Measurement data and specifications
Data extraction and processing protocols– Image scanning hardware and software– processing procedures and parameters– Normalization, transformation and data selection
Array Design:– General array design, including the platform type– Array feature and annotation
43
Microarray Flow
44
Sample Preparation
45
Two major technologies
• cDNA arrays
- probes are placed on the slides
- allows comparison of different cell types
• Oligonucleotide arrays
- partial sequences are printed on the array
- measure values in one tissue type
46
Two Different Types of Microarrays
• Custom spotted arrays (up to 20,000 sequences)– cDNA– Oligonucleotide
• High-density (up to 100,000 sequences) synthetic oligonucleotide arrays– Affymetrix (25 bases)
47
Custom Arrays
• Mostly cDNA arrays
• 2-dye (2-channel)– RNA from two sources (cDNA created)
• Source 1: labeled with red dye• Source 2: labeled with green dye
48
Two Channel Microarrays
• Microarrays measure gene expression
• Two different samples:– Control (green label)– Sample (red label)
• Both are washed over the microarray– Hybridization occurs – Each spot is one of 4 colors
49
50
cDNA microarray experiments
mRNA levels compared in many different contexts
• Different tissues, same organism (brain v. liver) • Same tissue, same organism (ttt v. ctl, tumor v. non-
tumor) • Same tissue, different organisms (wt v. ko, tg, or
mutant)
• Time course experiments (effect of ttt, development)
• Other special designs (e.g. to detect spatial patterns).
51
cDNA Microarray
• Measure the relative levels of expression
• Parallel analysis
• Competitive hybridization
• Need cDNA library
mRNA cDNA
Reverse Transcription
52
PCR Amplification
Printing
Hybridization
Laser Scan
Labeling
SamplesReverse Transcription
Expression Data
53
54
Exponential Amplification of a Gene
Return
55
Labeling and Hybridization of
Sample cDNAs
Return
56
57
cDNA microarrays
Compare the genetic expression in two samples of cells
PRINTcDNA from one gene on each spot
SAMPLEScDNA labelled red/green
e.g. treatment / control
normal / tumor tissue
58
HYBRIDIZE
Add equal amounts of labelled cDNA samples to microarray.
SCAN
Laser Detector
59
Looking at the transcriptome: DNA
microarrays
extract mRNA
make labeled cDNA
hybridize to microarray
cell type A
cell type B
more in “A ”
more in “B”
equal in A & B
60
61
Microarrays provide a means to measure gene expression
62
63
64(Slide source: http://www.bsi.vt.edu/)
65
Microarray Image Analysis
• Microarrays detect gene interactions: 4 colors: – Green: high control– Red: High sample– Yellow: Equal– Black: None
• Problem is to quantify image signals
66
Information Extraction
— Spot Intensities—mean (pixel intensities).—median (pixel intensities).
— Background values—Local —Morphological opening—Constant (global)—None
— Quality Information
Take the average
Speed Group Microarray Page
http://stat-www.berkeley.edu/users/terry/zarray/Html/image.html
Signal
Background
67
Data verification• Gene expression ratio?
Low High Expression level
Gen A Gen B
Sample 2
Sample 1
68
Quantification of expression
For each spot on the slide we calculate
Red intensity = Rfg - Rbg
(fg = foreground, bg = background) and
Green intensity = Gfg - Gbg
and combine them in the log (base 2) ratio
Log2( Red intensity / Green intensity)
69
Data Normalization• Purpose
Adjust bias from variation in microarray technology.
E.g. differences between labeling, scanner setting, spatial positions
• Within-array normalizationlogarithmic transformation of ratio, subtract by mean log ratio
Red Green Difference Ratio (G/R) Log2 Ratio Centered R
16500 15104 -1396 0.915 -0.128 -0.048
357 158 -199 0.443 -1.175 -1.095
8250 8025 -225 0.973 -0.039 0.040
978 836 -142 0.855 -0.226 -0.146
65 89 24 1.369 0.453 0.533
684 1368 529 2.000 1.000 1.080
13772 11209 -2563 0.814 -0.297 -0.217
856 731 -125 0.854 -0.228 -0.148
70
Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but
growing,
Genes
Slides
Gene expression level of gene 5 in slide 4
= Log2( Red intensity / Green intensity)
slide 1 slide 2 slide 3 slide 4 slide 5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale.
71
• Microarray data converted to n x p table
(p –gene number, n – sample number)
0.091.85Gene 4
1.053.34Gene 3
10.53.2Gene 2
2.081.04Gene 1
Sample 2Sample 1
Microarray gene expression data
72
Statistical Analysis• Differences in ratios due to
– random variation
– meaningful changes
• Convention
– ratio >= 2 or ratio <= ½
• Analysis of variance (ANOVA)– 4 and 10 replicates of each treatment
– statistical significance
73
Single Color Microarrays
• Prefabricated – Affymetrix (25mers)
• Custom– cDNA (500 bases or so)– Spotted oligos (70-80 bases)
74
Single Color Microarrays
• Expressed sequences washed over chips
• Expressed genes hybridize
• Light passed under to see intensity (or hybridized oligos show dark color)
75
Affymetrix GeneChip System
• Large number of genes and ESTs
• Several number of species
• Oligonucleotide arrays for expression monitoring are
designed and synthesized based on sequence
information alone, without the need for physical
intermediates such as clones, PCR products, cDNAs.
• Printed oligos are of the same length, allowing for
equal hybridization.
76
Affymetrix Technology
DESOKY, 2003
77
Affymetrix Technology
Biotin (one dye) instead of 2 colors
One treatment per chip• For two conditions, need two slides• Compare patterns of both slides to get results
11, 16, or 20 gene markers pairs per gene
DESOKY, 2003
78
Affymetrix Technology
DESOKY, 2003
79
Affymetrix Genechip: experimental steps
80
81
Lithography
• It is a printing technology.• Lithography was invented by Alois Senefelder
in Germany in 1798.• The printing and non-printing areas of the
plate are all at the same level, as opposed to intaglio and relief processes in which the design is cut into the printing block.
• Lithography is based on the chemical repellence of oil and water.
82
Affymetrix TechnologyLight-directed synthesis of DNA chips
• Attachment of synthetic linkers modified with photochemically removable protecting groups to a glass substrate and direct light through a photolithographic mask to specific areas on the surface to produce localized photodeprotection.
• The first of a series of chemical building blocks, hydroxyl-protected deoxynucleosides, is incubated with the surface, and chemical coupling occurs at those sites that have been illuminated in the preceding step.
• Next, light is directed to different regions of the substrate by a new mask, and the chemical cycle is repeated.
• Current technology allow for 300,000 polydeoxynucleotides in a 1.28x1.28 cm arrays.
83
Affymetrix Array Construction
STROMBERG, 2003
84
85
86
PM to maximize hybridization
MM to ascertain the degree of cross-
hybridization
Affymetrix Design of probes
87
Affy Tech – Number of Features
Multipleoligo probes
25-mers
Features
5’ 3’Gene Sequence
– Use multiple oligos per gene
– Redundancy improves detection and quantification of the target gene
DESOKY, 2003
88
Affy Tech – Mismatches for Control
Multipleoligo probes
25-mers
Perfect MatchMismatch
5’ 3’Gene Sequence
• Each probe has a “control” – a DNA sequence which differs only slightly from the feature
• In a 25-mer, the mismatch sequence differs in the 13th position (A-T or G-C)
DESOKY, 2003
89
90
Probe Tiling Strategy
• Gene expression monitoring with oligonucleotide arrays. Expression probe and array design. Oligonucleotide probes are chosen based on uniqueness criteria and composition design rules. For eukaryotic organisms, probes are chosen typically from the 3´ end of the gene or transcript (nearer to the poly(A) tail) to reduce problems that may arise from the use of partially degraded mRNA. The use of the PM minus MM differences averaged across a set of probes greatly reduces the contribution of background and cross-hybridization and increases the quantitative accuracy and reproducibility of the measurements.
91
PMMM
Probe set
Probe pair
STROMBERG, 2003
92
Affymetrix Data
• Each gene labeled as “present”, “marginal”, or “absent.” – Present: gene expressed and reliably
detected in the RNA sample
• Label chosen based on a p-value
93
94
Why Probe redundancy?• use of multiple independent detectors for the same molecule improves
signal-to-noise ratios (due to averaging over the intensities of multiple array features), improves the accuracy of RNA quantification (averaging and outlier rejection), increases the dynamic range, mitigates effects due to cross-hybridization, and drastically reduces the rate of false positives and miscalls.
• An additional level of redundancy comes from the use of mismatch (MM) control probes that are identical to their perfect match (PM) partners except for a single base difference in a central position. The MM probes act as specificity controls that allow the direct subtraction of both background and cross-hybridization signals, and allow discrimination between ‘real’ signals and those due to non-specific or semi-specific hybridization (hybridization of the intended RNA molecules produces more signal for the PM probes than for the MM probes resulting in consistent patterns that are highly unlikely to occur by chance
95
96
97
98
Gene Expression Data
Gene expression data on p genes for n samples
Genes
mRNA samples
Gene expression level of gene i in mRNA sample j
=Log (Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
sample1 sample2 sample3 sample4 sample5 …
1 0.46 0.30 0.80 1.51 0.90 ...2 -0.10 0.49 0.24 0.06 0.46 ...3 0.15 0.74 0.04 0.10 0.20 ...4 -0.45 -1.03 -0.79 -0.56 -0.32 ...5 -0.06 1.06 1.35 1.09 -1.09 ...
99
100
101
What is gene expression?
Gene expression= Expression degree of a gene in a particular experiment (protein)
genes
Experiments (overtime)
Base line expression
Higherexpressioncompared tobaseline
Lowerexpressioncompared tobaseline
Spellman et al Mol. Biol. Cell 1998
102
Looking at the transcriptome: microarrays
genes
co
nd
itio
ns
condition 1 condition 2
condition 3
statistical processing and analysis
103
104
Microarrays yield information
•
Image: bioinfo.mbb.yale.edu/~mbg/ fun3/microarray-mona/
105
Are they important for clinical use?
High-throughput Technologies and ‘OMİKS’ Science
106
Adrenal Gland
Endometrium
Pancreas
Brain
Breast
Uterus
Esophagus
Gall BladderKidney
LiverLung
Ovary
Skin Bone
Stomach
ThyroidHead & Neck
ProstateGerm Cell
Soft Tissue
Lymph
CervixBladder
GISTColon
Adrenal Gland
Endometrium
Pancreas
Brain
Breast
Uterus
Esophagus
Gall BladderKidney
LiverLung
Ovary
Skin Bone
Stomach
ThyroidHead & Neck
ProstateGerm Cell
Soft Tissue
Lymph
CervixBladder
GISTColon
Adrenal Gland
Endometrium
Pancreas
Brain
Breast
Uterus
Esophagus
Gall BladderKidney
LiverLung
Ovary
Skin Bone
Stomach
ThyroidHead & Neck
ProstateGerm Cell
Soft Tissue
Lymph
CervixBladder
GISTColon
Why gene expression profiles can classify cancer types?
Cancers from different origins are
Derived from cells thatpasses through differentdevelopmental stages.
Expression profiles of thecells coming from different
developmental stagesdiffer from each other.
107
Revolution of Breast Cancer Classification
DNA Chip Analysis
108
(Baselga and Norton, 2002)
Breast Cancer Classification
109
Sorlie et al., 2001
Breast Cancer Classification
110
Portrait of Breast Cancer
Sørlie et al. Proc Natl Acad Sci U S A. 2001 Sep 11;98(19):10869-10874.
Basal–like
HER-2
“Normal
Luminal B
Luminal A
111
Subtypes of breast cancer identified by gene expression patterns (Sorlie et al, PNAS, 98: 10969-74, 2001)
Gene expression profiles provide classification of
the sub-types of cancers with different clinical
outcome.
Two ER positive subgroup:
•Luminal A Best clinical outcome
•Luminal B
Three ER negative subgroup:
•“Normal” breast-like
•ERBB2+ (ERBB2 amplic high expression)
•Basal-like Worst clinical outcome
112
Molecular Grading of Breast Cancer
Sotiriou C, et al.. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006 Feb 15;98(4):262-72.
• Gen ifade profil verisi meme kanserinde iki moleküler derece (grade) göstermektedir.
• Histolojik Grade 2 durumları moleküler grade 1 ve 2 arasında dağılmıştır.
• Moleküler dereceleme ER/PR ve HER2 gibi geleneksel prognostik faktör multivaryant analizlerinden daha iyi performans vermektedir.
113
Gene-based breast cancer testMammaPrint Array
FDA Approved
MammaPrint 70 genin aktivitesini ölçen DNA
mikroarray-bazlı bir testtir.
- Test ile bu genlerin herbirinin kadının meme
kanseri örneğindeki ifadeleri ölçülmekte ve özel
bir hesaplama kullanarak hastanın kanserinin
diğer bölgelere geçme olasılığının düşük mü
yoksa yüksek riskli mi olduğu hesaplamaktadır.
- Kimin tedavi edilmesi gerektiğine yön verici….
114
“ MammaPrint is a DNAmicroarray-based test thatmeasures the activity of 70genes... The test measureseach of these genes in asample of a woman'sbreast-cancer tumor andthen uses a specific formulato determine whether thepatient is deemed low riskor high risk for the spreadof the cancer to anothersite.”
FDA Approves Gene-BasedBreast Cancer Test*
115
DNA MİKROARRAY ANALİZİ İLE GEN İMZASI OLUŞTURMA - MammaPrint ARRAY
78 adet lenf nodu negatif genç hastanın primer meme tümörü kullanıldı:
- Bu hastalardan 5 yıl içinde uzak metastaz görülen 34’ünün gen ifade profilleri, 5 yıl içinde hastalığı olmadan yaşayan 44 hastanın gen ifade profilleri ile karşılaştırıldı.
- Analizler meme tümörlerini iyi veya kötü prognozlu grup olarak sınıflandırmalarına olanak veren 70 genlik bir gen ifade setinin çıkarılmasını sağladı.
116
Intra-operative Cancer Detection
“The rapid RT-PCR assay has found a breast cancer stem cell related mRNA signature in the sentinel
lymph node”