chapter 5 organization of human genome · 2012. 4. 13. · chapter 5 organization of human genome....

Chapter 5 Chapter 5

Organization of Human Organization of Human GenomeGenome

Outline of this chapter

O i ti f hOrganization of human genome G t tGene structure

Human genome:Human genome:total genetic information in human cellstotal genetic information in human cells

Nuclear Genome

Mitochondrial Genome

Cell

HistonesDNADNA

Human Genomet t l ti i f ti i h ll

Human Genometotal genetic information in human cells

Nuclear Genome:total DNA in nucleus

Mitochondrial Genome:total DNA in mitochondria

Nuclear Genome3 x 10 9 base pairsDistributed between 24 different types ofDistributed between 24 different types of

linear double-stranded DNA molecules 130 Mb on average, but varying between 50

and 260 Mband 260 Mb

DNA t t f h h

5.1 5.1 Organization of human genomeOrganization of human genome

Chromosome Amount of DNA (Mb) Chromosome Amount of DNA (Mb)

DNA content of human chromosomes

1 279(30) 13 118(16)2 251(3) 14 107(16)2 251(3) 14 107(16)3 221(3) 15 100(17)4 197(3) 16 104(15)( ) ( )5 198(3) 17 88(3)6 176(3) 18 86(3)( ) ( )7 163(3) 19 72(3)8 148(3) 20 66(3)9 140(22) 21 45(11)10 143(3) 22 48(13)11 148(3) X 163(3)12 142(3) Y 51(27)

各种生物基因组大小比较(从原核生物到哺乳动物)。

不同物种基因组大小不同物种基因组大小种类 Mb种类 Mb

大肠杆菌 4.64啤酒酵母 12 1啤酒酵母 12.1线虫 100果蝇 140果蝇 140蝗虫 5000小鼠 3300豌豆 4800玉米 5000小麦 17000小麦

人 3000

C值悖论

生物体单倍体DNA总量称为 C值。

高等生物具有比低等生物更复杂的生命活动，所

以，理论上应该是它们的C值也应该更高。但是事以，理论上应该是它们的C值也应该更高。但是事实上C值没有体现出与物种进化程度相关的趋势。高等生物的C值不一定就意味着它的C值高于比它高等生物的C值不一定就意味着它的C值高于比它低等的生物。这种生物学上的DNA总量的比较和矛盾，称为C值悖论(C value paradox)。矛盾，称为C值悖论(C value paradox)。

表现在两个方面：

与预期的编码蛋白质基因的数量相比，基因组DNA与预期的编码蛋白质基因的数量相比，基因组DNA含量过多。

一些物种之间的复杂性变化范围并不大，但C值有一些物种之间的复杂性变化范围并不大，但C值有很大的变化范围。

单倍体基因组DNA 含量在低等真核生物中与形态复杂性有一定的正相关，但在高等真核生物中却非如此，它们的单倍体基因组DNA含量变化不定。

5.1 5.1 Organization of human genomeOrganization of human genome

根据真核生物的复性动力学，其DNA序列可分为：

Unique sequenceModerately repetitive sequencesModerately repetitive sequences Highly repetitive sequences g y p q

一些多倍体植物中没有非重复DNA, 复性最慢的也有3个一些多倍体植物中没有非重复DNA, 复性最慢的也有3个多拷贝。

而在螃蟹的基因组中，没有中等重复序列，只有单一序列和高度重复序列。

在低等真核生物中，没有高度重复序列

N l G m5.1 5.1 Organization of human genomeOrganization of human genome

Denaturation and Renaturation

Nuclear Genome

Denaturation and Renaturation

ATGAGCTGTACGATCGTG

Denatured DNA

TACTCGACATGCTAGCACATGAGCTGTACGATCGTG

TACTCGACATGCTAGCACATGAGCTGTACGATCGTG

G GC G CG CG G

TACTCGACATGCTAGCAC

Double stranded DNA

TACTCGACATGCTAGCAC

Double stranded DNATACTCGACATGCTAGCAC

Single stranded DNA


Nuclear Genome

denature

renaturerenature


unique sequencesNuclear Genome

unique sequences


unique sequencesNuclear Genome

unique sequences

the most common, 60% of human genome

including most of the protein-coding genes

copy numbers : single or several copiescopy numbers : single or several copies

mRNA 的复性动力学曲线表明，大多数mRNA来自非重复DNA，其余来自中度重复DNA，无mRNA来自高度重复DNA。


Repetitive sequencesNuclear Genome

Repetitive sequences


Moderately Repetitive sequences

Nuclear Genome


30% out of the human genome

copies: 102-105copies: 10 -10

contain rRNA genes, tRNA genes, histone genes,

genes of heavy strand and light strand of

immunoglobulin, and so on.immunoglobulin, and so on.

include SINE and LINE



Nuclear Genome


Short Interspersed Nuclear Element ,SINE

100 to 400 bp in length

copies: 105

for example : Alu repeat sequence



Nuclear Genome

Alu repeatModerately Repetitive sequences

the most common interspersted

repeat in human genomerepeat in human genome

300bp in length, cut by Alu I into

t f t 170b d 130b

5’ …..AGCT……3’

3’……TCGA……5’two fragments ,170bp and 130bp

found only in primates, while other

3 ……TCGA……5

classes of SINEs are common in

other mammalian species.



Nuclear Genome

Long Interspersed Nuclear Element LINE


Long Interspersed Nuclear Element, LINE

about 6,000 bp (5,000~7000bp) in length

copies: 102-104

e.g. Kpn I families 6 5kb in length cut by Kpn I into four fragments6.5kb in length, cut by Kpn I into four fragments, 1.9kb, 1.8kb, 1.2kb and 1.5kb ,respectively)

N l G m

Highly Repetitive sequences

Nuclear Genome


10% of human genome

length of repeat unit: <200bp

copies: 106-108copies: 10 -10

types: satellite DNA and reverted repeat sequence

N l G m


Nuclear Genome


Satellite DNA: constitute the centromere, telomereof the human chromosome, and constitutive heterochromatin region on some chromosomes. Its function is unknown now.function is unknown now.

minisatellite DNA:

6bp~25bp of repeat unit in lengthmicrosatllite DNA:microsatllite DNA:

2bp~5bp of repeat unit in length

Satellite DNA is foundSatellite DNA is found in the constitutiveHeterochromatinHeterochromatin region

用CsCl密度梯度离心通常将小鼠DNA分成一条主带和一条卫星带。

N l G mReverted Repeat Sequence

Nuclear Genome

5' TAATCCCACAGCCGCCAGTTCCGCTGGCGGCATTT 3'3' ATTAGGGTGTCGGCGGTCAAGGCGACCGCCGTAAA5'

N l G mNuclear Genome

Multigene families

A group of functionally related genes formed by duplication and variation of an ancestral gene, repeated on the same or different chromosomesclustered gene family (gene cluster)g y (g )Interspersed gene family

Clustered gene familiesGrowth hormone 5 copies (67kb)Growth hormone 5 copies (67kb)αglobin 7 copies (50kb)Ho genes (m lti) 38 fo r cl stersHox genes (multi) 38 four clustersOlfactory receptors 1000 in 25

large clusters

Interspersed gene familiesPax 9 copiesPax 9 copiesActin >20 copiesAlu elements (repeats) 1.1 millionLINE elements (L1) 200-500,000( ) ,

所有珠蛋白基因皆是从同一个祖代基因所有珠蛋白基因皆是从同一个祖代基因通过不断重复、转座和突变而来的

Formation of higher order repeat unitsg p

N l G m

PseudogenesNuclear Genome

Pseudogenes

• Nonfunctional copies of genes• Nonfunctional copies of genes• Formed by duplication of ancestral gene, or

t i ti ( d i t ti )reverse transcription (and integration)• Not expressed due to mutations that produce a

stop codon (nonsense or frameshift) or prevent mRNA processing, or due to lack of regulatory sequences

Nonprocessed pseudogenesNonprocessed pseudogenes

Processed pseudogenes

假基因常见于多基因家族如ß球蛋白，HLA,免疫球蛋白家族等。疫球蛋白家族等。

单拷贝基因家族也可产生多个假基因，如精氨酸琥珀酸合成酶（ASS)基因有四个假基因。酸琥珀酸合成酶（ASS)基因有四个假基因。

假基因数目一般较少，往往只占基因总数的一小部分，但编码小鼠核糖体的活性基因与假小部分，但编码小鼠核糖体的活性基因与假基因的比例高达1：15。

Nuclear genome 3000Mb3000Mb

Gene and gene related sequences

Extragenic sequences

30% 70%

coding DNA 10%

noncoding DNA 90%

Unique80%

repetitive20%10% 90% 80% 20%

pseudogenes Introns,flanking,etc

Mit h d i l G mMitochondrial Genome

Mit h d i l G mMitochondrial Genome

– Small (16.5 kb) circular DNA– rRNA, tRNA and protein encoding genes

(37)– 1 gene/0.45 kb– Very few repeatsVery few repeats– No introns

93% di– 93% coding– No recombination– Maternal inheritance

Mi h d i l GMitochondrial Genome

Mitochondrial genome 16569bp 37genes16569bp,37genes

rRNA genes(2)

tRNA genesPolypeptide

encoding(2) (22)encoding

genes(13)

5.2 5.2 GeneGene StructureStructure

Definition

Gene is a segment of DNAGene is a segment of DNAencoding a functional productg p

-RNA or polypeptidesRNA or polypeptides

Perhaps 30-40,000 genesh hin the human genome.

How can so few genes make human?

How can so many genesHow can so many genes make rice?

Perhaps 50-60,000 genes

in the rice genome.

基因组大小 & 基因数

基因数量 -> 生物复杂性？基因数量生物复杂性？

• 1. 基因数量的变化，无法解释生物学功能、调控机理1. 基因数量的变化，无法解释生物学功能、调控机理

以及物种多样性和复杂性的巨大变化

• 2. 当前解释：蛋白质组的多样性和复杂性 -> 物种的

多样性和复杂性；‾10 000 000种蛋白质分子多样性和复杂性； 10,000,000种蛋白质分子

• 3. 两种观点：

–a. 转录后层面，mRNA剪切，产生拼接异构体

–b 蛋白质层面蛋白质序列上一个或多个位点上发生的翻b. 蛋白质层面，蛋白质序列上一个或多个位点上发生的翻

译后修饰

Genotype to Phenotypeyp yp

Post transcription level：mRNASplicing

Post transcription level：mRNASplicing

mRNA Splicing

isoform 1 isoform 2 isoform 3

Post translation level：Protein modification

Sumoylation

Phosphorylationp y

Palmitoylationy

AcetylationUbiquitination

Interation network

Protein-protein pinteraction

Hybridization of mRNA and DNA

Hybridization of mRNA and DNADNADNA

Eukaryotic genes are split genesy g g极少数基因除外，如Thrombomodulin,THBD基因

A “Simple” Eukaryotic Gene

Introns

Exons

Flanking sequencesFlanking sequences

真核生物基因总体大小差别迥异，特别是酵母和高等真核生物存在显著差别。

酵母基因平均1.4kb 长，只有少部分基因大于5kb。相反，果蝇和哺乳动物中大因大于5kb。相反，果蝇和哺乳动物中大多数基因长度在5kb 到100kb 之间，只有少部分基因小于2kb。

酵母基因常常很小但果蝇和哺乳动物基酵母基因常常很小，但果蝇和哺乳动物基因大小散乱分布，差别很大。

不同真核生物，如酵母、昆虫和哺乳动物中，基因的总体组织形式也不尽相同。基因的总体组织形式也不尽相同。

在酿酒酵母中，大部分基因 (>96%)是非割裂基因，而包含外显子的部分通常很紧凑。基因，而包含外显子的部分通常很紧凑。

酵母大多数基因为非割裂基因，但果蝇和哺乳动物的基因绝大多数是割裂的(非断裂基因只有一个外显子）基因绝大多数是割裂的(非断裂基因只有一个外显子）

ExonsIntrons Splicing junction

Flanking sequences- Promoter- Enhancer/silencer- Enhancer/silencer- Terminator

ExonsSegment of a gene which is decoded to give a mature RNA product

Individual exons may contain coding DNAor noncoding DNA (untranslatedor noncoding DNA (untranslated sequences, UTS).

Coding regionCoding regionCoding regionCoding region

Nucleotides (open reading frame) encodingNucleotides (open reading frame) encoding the amino acid sequence of a protein

T l ti Translation Stop Translation Start Site (ATG)

UTS

Translation Stop Site( TAA,TAG,TGA)

Coding sequenceUTSUTS

AATAAA

ExonsT i i T i i Transcription Start Site

Transcription Stop Site

• 人类基因外显子很少超过800bp,少数基因除外，如VIII的外显子长约3106bp,ApoB基因的外显子约7572bp.3106bp,ApoB基因的外显子约7572bp.

编码蛋白的外显子常常很短

IntronsNoncoding DNA which separates neighboring exons in a gene During gene expression introns, likeDuring gene expression introns, like exons, are transcribed into RNA but the transcribed intron sequences aretranscribed intron sequences are subsequently removed by RNA splicing and are not present in mature mRNA.

Transcription Transcription Exons IntronsTranscription Start Site


UTS UTS

UTS UTS

Transcription

Primary RNA

Processing

RNA

UTS UTSMature RNA

脊椎动物基因中内含子大小差别十分明显。脊椎动物基因中内含子大小差别十分明显。

外显子序列保守，内含子序列多变

因编码蛋白质功能的需要，外显子区域是保守的

内含子比外显子进化快。当不同种间的基因进行比

较，有时其外显子同源，而内含子间变化巨大，甚至不存在任何相关序列。

外显子和内含子中突变率是相同的，但在外显子外显子和内含子中突变率是相同的，但在外显子中逆向选择使突变被更有效地剔除。

可利用保守的外显子分离基因可利用保守的外显子分离基因

鉴定基因的主要方法大都以外显子的保守性和内含子的多变性比较为基础。一个功能在不同种内是保守的基因，其代个功能在不同种内是保守的基因，其代表的蛋白质序列应该有两个性质：具有一个开放读框，并与其他种属有相关的一个开放读框，并与其他种属有相关的序列。这些特点可以用来分离基因。

Zoo blot☺利用保守基因的两个特性检测基因的存在，首先与不同物种的基因组样品进行southern 印记先与不同物种的基因组样品进行southern 印记杂交，产生阳性杂交信号的基因组DNA克隆可能含有在进化上十分保守的编码序列，然后检查含有在进化上十分保守的编码序列，然后检查能够杂交的序列中是否含有可读框。

☺依靠这种特性，可分离纯化我们不太熟悉的但具有某些功能的基因。

利用人Y染色体上的zfy基因做探针与其它动物的性染色体杂交的结果

通过Z Bl tti DNA杂交基因组杂通过Zoo-Blotting，cDNA杂交，基因组杂交，蛋白质分析，DMD 基因得以定性。

S li j iSplice junction(exon/intron boundary)(exon/intron boundary)

Splice donor site: the junction between the end of an exon and the start of the downstream intron, commencing with the dinucleotide GT. Splice acceptor site: the junction between the end of an intron terminating in the dinucleotideend of an intron terminating in the dinucleotide AG, and the start of the next exon.Branch site: the third conserved intronicBranch site: the third conserved intronic sequence that is known to be functionally important in splicing

S li j iSplice junction(exon/intron boundary)(exon/intron boundary)

Splice junctionSplice junction(exon/intron boundary)(e o / t o bou da y)

ATGAAAAGGAAAGTCCTTACTTTTCTTTTGTTTTGCAAATTAGAAAGCCGACCGAGCAAAGAGATTTGAATTTTTACTGAAGCAGACAGAACTTTTTGCACATTTCATTCAGCCTTCAGCACAGAAATCTCCAACATCTCCACTGAACATGAAATTGGGACGTCCCCGAATAAAGAAAGAACAGAAATCTCCAACATCTCCACTGAACATGAAATTGGGACGTCCCCGAATAAAGAAAGATGAAAAGCAGAGCTTAATTTCTGCTGGAGAGTATGTTGGCACCTCTCTCTCTACTTTCTT TCCTTTCTCCTACCTTTTCTTCCTCCTGTCCTCCCTTGATCCCTTCATGCACCCCTTCGC TCTTCATTGTTCAGTATACCTTCATGTGACAAAAAATATTGCCATTATAATTATGTTTTG AAGACAACTATATTTTTTTCTCACTAGAGGCTGATCAGTAAAAATGTAGGCTGGTTCTAC TGATTTCTAAGCAAGACCTTGGACAACTCATTCTTTTTCTATAAAAGATAATAGCCATGT ACACTGATTTAATTGATACCTTATCATTTAGGTCGAATATGAAGGGATTTCCTTTTTTAA TTTCAGCTACCGCCATAGGCGCACAGAGCAAGAAGAAGATGAAGAGCTACTGTCTGAGAGTTTCAGCTACCGCCATAGGCGCACAGAGCAAGAAGAAGATGAAGAGCTACTGTCTGAGAGTCGGAAAACATCTAATGTGTGTATTAGATTTGAGGTGTCACCTTCATGTAAGTACTTCAT CACATTGGTGAGTTCTTTTTCAATTTAGTTTTAGAAAAATTTTACTTGAGTATGTTAATG AAAGTATGAAATGTCCTTGCATTTTTTCACCAGATGTGAAAGGGGGGCCACTGAGAGATTATCAGATTCGAGGACTGAATTGGTTGATCTCTTTATATGAAAATGGAGTCAATGGCATTTTGGCTGATGAAATGGTAAGGAATTGGTAGCTAAAAACACATTCTCAGTTATCAATGATTT

Splice junctionSplice junction(exon/intron boundary)

Consensus sequences are conserved throughout eukaryotes

(e o / t o bou da y)

throughout eukaryotesConservation of sequence is expected,

since recognition of sequences issince recognition of sequences is accomplished by base pairing with snRNPs RNA componentsnRNPs RNA component

Secondary t t d l fstructure model of

human U1 snRNP. The region where itThe region where it recognizes the pre-mRNA is also shown

Flanking Sequences• 5’ untranscribed region. Signals for initiation

and control of transcriptionand control of transcription- Promoter

• Enhancer / Silencer-Enhancer stimulates transcription-Silencer inhibits transcription

• 3’ untranscribed region Signals for• 3 untranscribed region. Signals for termination of transcription

Regulatory Sequences

Promoter/Proximal ElementsOccur within ~200 bp of the start site.Contain up to ~20 bp.p pCell-type specific

Basal Promoter Analysis

• TATAA(T)AA(T) -30 TBP• GGC(T)CAATCT 75 CTF/NF1• GGC(T)CAATCT -75 CTF/NF1• GGGCGG -90 SP1

+1

TATACAATGC

Promoter-ProximalPromoter Proximal Elements

TATA boxMost commonMost commonHighly transcribed genes25~35 base pairs upstream of start siteInitiatorAt start siteGC b xes (CpG islands)GC boxes (CpG islands)“Housekeeping” genes (transcribed at low rate)Within ~100 base pairs of start site

TATA box~ 25 bp upstream of +1Only promoter element that is relatively fixed in relation to start pointTends to be surrounded by GC-richTends to be surrounded by GC rich sequencesSingle base substitutions in TATASingle base substitutions in TATA strong promoter down mutationsSome promoters do not contain TATASome promoters do not contain TATA

InitiatorInstead of a TATA box, some eukaryotic gene contain an alternative promoter element, called an initiator.Initiator is highly degenerative.g y g

+15’ Y Y A N T/A Y Y Y

Y = pyrimidine (C or T) N = any

C G i l dCpG islandGenes coding for intermediary metabolism are transcribed at low rates, and do not contain a TATA box or initiatoror initiator.Most genes of this type contain a CG-rich stretch of 20-50 nt within ~100 bp upstream of the start site region50 nt within 100 bp upstream of the start site region.A transcription factor called SP1 recognizes these CG-rich region.rich region.Gives multiple alternative mRNA start sites.

mRNA~100 bp

Multiple CpG island

~100 bp

5’-start sitesCpG island

研究真核生物启动子结构和功能的方法

确定启动子的位置和长度：缺失实验确定上游边界，

缺失结合重组实验来确定下游边界。缺失结合重组实验来确定下游边界。

位置确定后，采用点突变来研究每个碱基在启动子中所起的作用。所起的作用。

启动子DNA结合蛋白的方法：

酵母单杂交技术酵母单杂交技术

噬菌体展示技术

DNA迁移率变动实验DNA迁移率变动实验

DNaseI足迹实验

DNA迁移率变动实验（EMSA)Electrophoresis Mobility Shift Assay

一种体外研究DNA与蛋白质相互作用的特殊的凝胶电泳技术.基本原理为: 在凝胶电泳中由于电场的作用,基本原理为: 在凝胶电泳中由于电场的作用,小分子DNA片段比其结合了蛋白质的DNA片段向阳极移动的速度快.因此可标记短的双链DNA阳极移动的速度快.因此可标记短的双链DNA 片段将其与蛋白质混合,对混合物进行凝胶电泳,若目的DNA与特异性蛋白质结合其向阳极移动的速度受到阻滞对凝胶进行放射性自显影就动的速度受到阻滞对凝胶进行放射性自显影就可找到DNA结合蛋白. 由于其特异性好DNA 迁移率变动试验常用来鉴由于其特异性好DNA 迁移率变动试验常用来鉴定其他方法筛选出的结果

EMSA - electrophoretic mobility shift assayEMSA electrophoretic mobility shift assay

P bProbe NE

EMSA - electrophoretic mobility shift assayEMSA electrophoretic mobility shift assay

NENE

A Single Nucleotide Polymorphism in the MDM2 Promoter Attenuates the p53 Tumor Suppressor Pathway and Accelerates Tumor Formation in Humans Cell VolTumor Suppressor Pathway and Accelerates Tumor Formation in Humans. Cell, Vol. 119, 591–602, November 24, 2004,

E hEnhancersCan be located several kb from promoterCan be present in either orientation prelative to the promoterContain elements that bind inducibleContain elements that bind inducible factorsUsually ~100 200 bp long containingUsually ~100-200 bp long, containing multiple 8- to 20-bp control elements.T t f ti ifi d/ t lTargets for tissue specific and/or temporal regulation

Enhancer

Variable distance fromdistance from promoter

EitherEither orientation

Upstream or pdownstream of gene

TERMINATIONTERMINATION

• RNA polymerase meets the terminator• Terminator sequence: AAUAAA

• RNA polymerase releases from DNA• Prokaryotes-releases at termination

i lsignal• Eukaryotes-releases 10-35 base pairs

after termination signalafter termination signal

T i tiTerminationDiff t h i f t i ti• Different mechanisms of termination

• Prokaryotes– rho-independent termination: formation of arho independent termination: formation of a

hairpin structurerho dependent termination: external protein– rho-dependent termination: external protein disrupts transcription

• Eukaryotes– cleavage of the RNA by an external protein

Rho-independent terminator

Translation Start Site

Translation Stop Site

IntronsUTS

Start Site

UTS AATAAA

Stop Site

GT AG GT AG

EExonsTranscription Start Site


Flanking sequences

Start Site Stop Site

Prometor Enhancer Terminator

Distribution

• Different density of genes along a y g gchromosome

• Different density of genes between chromosomes

(exon-intron-exon)n structure of various genes

histone

total = 400 bp; exon = 400 bp

β-globin

p; p

HGPRT

total = 1,660 bp; exons = 990 bp

HGPRT(HPRT)

total = 42,830 bp; exons = 1263 bp

factor VIII

, p; p

t t l 186 000 b 9 000 btotal = ~186,000 bp; exons = ~9,000 bp

Gene product Size of gene (kb) Number of exons

Average size of exon (bp)

Average size of intron (bp)

tRNAtyr 0.1 2 50 20

Insulin 1.4 3 155 480

β-Globin 1.6 3 150 490

Class I HLA 3.5 8 187 260

Serum albumin 18 14 137 1100Serum albumin 18 14 137 1100

Type VII collagen 31 118 77 190

Complement C3 41 29 122 900

Phenylalanine hydroxylase

90 26 96 3500

Factor VIII 186 26 375 7100

CFTR (cystic fibrosis)

250 27 227 9100

Dystrophin 2400 79 180 30 000

GenesGenes

• Protein Coding

• RNA genes• RNA genes– rRNA– tRNA– snRNA, …N ,

”Average” gene organization• Single, unique genes consisting of exons

interrupted by introns onlyinterrupted by introns only

O i iOther gene organizations• Genes-within-genes

– It is not uncommon that short genes are i i i flocated inside an intron of another gene

Intron 26 of the NF1 gene containsIntron 26 of the NF1 gene contains three internal genes.

THE END

chapter 5 organization of human genome · 2012. 4. 13. · chapter 5 organization of human genome....

Documents