complete structure ofthe human a-albumin gene, …

5
Proc. Natl. Acad. Sci. USA Vol. 93, pp. 7557-7561, July 1996 Biochemistry Complete structure of the human a-albumin gene, a new member of the serum albumin multigene family (human chromosome 4q/a-fetoprotein/vitamin D-binding protein/tandemly linked genes) HITOMI NISHIO AND ACHILLES DUGAICZYK* Department of Biochemistry, University of California, Riverside, CA 92521 Communicated by Nicholas R. Cozzarelli, University of California, Berkeley, CA, April 1, 1996 (received for review January 22, 1996) ABSTRACT The nucleotide sequence of the human a-al- bumin gene, including 887 bp of the 5'-flanking region and 1311 bp of the 3'-flanking region (24,454 in total), was determined from three overlapping A phage clones. The sequence spans 22,256 bp from the cap site to the polyade- nylylation site, revealing a gene structure of 15 exons sepa- rated by 14 introns. The methionine initiation codon ATG is within exon 1; the termination codon TGA is within exon 14. Exon 15 is entirely untranslated and contains the polyadeny- lylation signal AATAAA. The deduced polypeptide chain is composed of a 21-amino-acid leader peptide, followed by 578 amino acids of the mature protein. There are seven repetitive DNA elements (Alu and Kpn) in the introns and 3'-flanking region. The sizes of the 15 a-albumin exons match closely those of the albumin, a-fetoprotein, and vitamin D-binding protein genes. The exons are symmetrically placed within the three domains of the individual proteins, and they share a characteristic codon splitting pattern that is conserved among members of the gene family. The results provide strong evidence that a-albumin belongs to, and most likely completes with, the serum albumin gene family. Based on structural similarity, a-albumin appears to be most closely related to a-fetoprotein. The complete structure of this family of four tandemly linked genes provides a well-characterized '200 kb locus in the 4q subcentromeric region of the human genome. Genes encoding the serum proteins, albumin (ALB), a-feto- protein (AFP), and vitamin D-binding protein (DBP; also known as group-specific component, have been known to constitute a gene family, based on their structural similarity at the amino acid and nucleotide levels. In addition to amino acid sequence similarity, members of the ALB family share a characteristic pattern of disulfide bridges in their polypeptide chains, resulting in a protein structure composed of three domains (1-5). At the gene level, the intron/exon splitting pattern of the human ALB, AFP, and DBP genes is practically the same (6-8), except for the loss of two exons from the DBP gene. Recently, a new serum protein, a-albumin (ALF), has been recognized from amino acid and cDNA sequence analysis to be a fourth member of this family (9, 10). ALF shares a significant amino acid sequence similarity with AFP and albumin, including the characteristic pattern of potentially disulfide-bonded Cys residues. In the human, the genes for ALB, AFP, ALF, and DBP are located in the subcentromeric region of chromosome 4q, and they are tandemly linked in the order 5'-ALB-5'-AFP-5'-ALF-5'-DBP-centromere (11-14). This chromosomal region was involved in a series of pericentric inversions in primate phylogeny (15). In the chimpanzee, the ALB and AFP genes are found on the short arm of chimpanzee chromosome 3 (corresponding to human chromosome 4) (16). Although there were similar pericentric inversions in the gorilla and orangutan phylogenies, the ALB and AFP genes remain on the long arm of chromosome 3 of the gorilla and orangutan (17). Whether the ALF and DBP genes were included in these chromosomal inversions remains an inter- esting question. The four genes are specifically expressed in the liver, with AFP and some ALB being also produced in the yolk sac. The differential expression of the four genes is developmentally regulated. The expression of the ALB gene starts in the fetal liver and is maintained in the adult at high levels. The expression of the AFP gene starts also in the fetal liver, but is turned off after birth. DBP production also starts in the fetal liver and continues in the adult. The expression of the ALF gene starts after birth, and the protein continues to be produced in the adult (9). The biological role of the newly discovered a-albumin remains to be elucidated. In the present study, the entire nucleotide sequence of the human ALF gene was determined, and the gene structure was compared with those of the other genes belonging to the gene family. It is conceivable that ALF is the last gene of the albumin gene family; determination of its structure would thus complete the structural analysis of this important subcentrom- ric gene locus and provide a well-characterized anchor point for work on the complete structure of human chromosome 4. MATERIALS AND METHODS Screening of Genomic Library. A human chromosome 4-specific library in Charon 40 (ATCC LA04NL01) was screened by chromosome walking from the 3'-end of the AAFP 26 clone (7) and hybridization to a human ALF cDNA probe (9). Three overlapping clones (A2, A19, and A15) were plaque- purified and characterized by restriction mapping and DNA sequencing. DNA Sequence Determination. EcoRI, HindIII, XbaI, and SacI restriction fragments from clones A2, A9, and X15 were subcloned into pBluescript (Stratagene) to generate multiple starting points for sequencing. Using the dideoxy thermal cycling methodology (Promega), "30% of the sequence was determined from two external primers flanking the multiple cloning sites in pBluescript. The remaining sequence was determined from new, internal primers, extending previous sequence data. About 30% of the sequence was determined from both DNA strands; the remaining 70% of the sequence is based on unambiguous sequencing gels. RESULTS AND DISCUSSION DNA Sequence of the ALF Gene. The entire DNA sequence of the human ALF gene (24,454 bp), including the 5'- and 3'-flanking regions, was determined from three overlapping A Abbreviations: ALB, albumin; ALF, a-albumin; AFP, a-fetoprotein; DBP, vitamin D-binding protein. Data deposition: The sequence reported in this paper has been deposited in the GenBank data base (accession no. U51243). *To whom reprint requests should be addressed. 7557 The publication costs of this article were defrayed in part by page charge payment. This article must therefore be hereby marked "advertisement" in accordance with 18 U.S.C. §1734 solely to indicate this fact. Downloaded by guest on November 8, 2021

Upload: others

Post on 09-Nov-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Complete structure ofthe human a-albumin gene, …

Proc. Natl. Acad. Sci. USAVol. 93, pp. 7557-7561, July 1996Biochemistry

Complete structure of the human a-albumin gene, a new memberof the serum albumin multigene family

(human chromosome 4q/a-fetoprotein/vitamin D-binding protein/tandemly linked genes)

HITOMI NISHIO AND ACHILLES DUGAICZYK*Department of Biochemistry, University of California, Riverside, CA 92521

Communicated by Nicholas R. Cozzarelli, University of California, Berkeley, CA, April 1, 1996 (received for review January 22, 1996)

ABSTRACT The nucleotide sequence of the human a-al-bumin gene, including 887 bp of the 5'-flanking region and1311 bp of the 3'-flanking region (24,454 in total), wasdetermined from three overlapping A phage clones. Thesequence spans 22,256 bp from the cap site to the polyade-nylylation site, revealing a gene structure of 15 exons sepa-rated by 14 introns. The methionine initiation codon ATG iswithin exon 1; the termination codon TGA is within exon 14.Exon 15 is entirely untranslated and contains the polyadeny-lylation signal AATAAA. The deduced polypeptide chain iscomposed of a 21-amino-acid leader peptide, followed by 578amino acids of the mature protein. There are seven repetitiveDNA elements (Alu and Kpn) in the introns and 3'-flankingregion. The sizes of the 15 a-albumin exons match closelythose of the albumin, a-fetoprotein, and vitamin D-bindingprotein genes. The exons are symmetrically placed within thethree domains of the individual proteins, and they share acharacteristic codon splitting pattern that is conserved amongmembers of the gene family. The results provide strongevidence that a-albumin belongs to, and most likely completeswith, the serum albumin gene family. Based on structuralsimilarity, a-albumin appears to be most closely related toa-fetoprotein. The complete structure of this family of fourtandemly linked genes provides a well-characterized '200 kblocus in the 4q subcentromeric region of the human genome.

Genes encoding the serum proteins, albumin (ALB), a-feto-protein (AFP), and vitamin D-binding protein (DBP; alsoknown as group-specific component, have been known toconstitute a gene family, based on their structural similarity atthe amino acid and nucleotide levels. In addition to amino acidsequence similarity, members of the ALB family share acharacteristic pattern of disulfide bridges in their polypeptidechains, resulting in a protein structure composed of threedomains (1-5). At the gene level, the intron/exon splittingpattern of the human ALB, AFP, and DBP genes is practicallythe same (6-8), except for the loss of two exons from the DBPgene. Recently, a new serum protein, a-albumin (ALF), hasbeen recognized from amino acid and cDNA sequence analysisto be a fourth member of this family (9, 10). ALF shares asignificant amino acid sequence similarity with AFP andalbumin, including the characteristic pattern of potentiallydisulfide-bonded Cys residues. In the human, the genes forALB, AFP, ALF, and DBP are located in the subcentromericregion of chromosome 4q, and they are tandemly linked in theorder 5'-ALB-5'-AFP-5'-ALF-5'-DBP-centromere (11-14).This chromosomal region was involved in a series of pericentricinversions in primate phylogeny (15). In the chimpanzee, theALB and AFP genes are found on the short arm of chimpanzeechromosome 3 (corresponding to human chromosome 4) (16).Although there were similar pericentric inversions in thegorilla and orangutan phylogenies, the ALB and AFP genes

remain on the long arm of chromosome 3 of the gorilla andorangutan (17). Whether the ALF and DBP genes wereincluded in these chromosomal inversions remains an inter-esting question.The four genes are specifically expressed in the liver, with

AFP and some ALB being also produced in the yolk sac. Thedifferential expression of the four genes is developmentallyregulated. The expression of the ALB gene starts in the fetalliver and is maintained in the adult at high levels. Theexpression of the AFP gene starts also in the fetal liver, but isturned off after birth. DBP production also starts in the fetalliver and continues in the adult. The expression of the ALFgene starts after birth, and the protein continues to beproduced in the adult (9). The biological role of the newlydiscovered a-albumin remains to be elucidated.

In the present study, the entire nucleotide sequence of thehuman ALF gene was determined, and the gene structure wascompared with those of the other genes belonging to the genefamily. It is conceivable that ALF is the last gene of thealbumin gene family; determination of its structure would thuscomplete the structural analysis of this important subcentrom-ric gene locus and provide a well-characterized anchor pointfor work on the complete structure of human chromosome 4.

MATERIALS AND METHODSScreening of Genomic Library. A human chromosome

4-specific library in Charon 40 (ATCC LA04NL01) wasscreened by chromosome walking from the 3'-end of the AAFP26 clone (7) and hybridization to a human ALF cDNA probe(9). Three overlapping clones (A2, A19, and A15) were plaque-purified and characterized by restriction mapping and DNAsequencing.DNA Sequence Determination. EcoRI, HindIII, XbaI, and

SacI restriction fragments from clones A2, A9, and X15 weresubcloned into pBluescript (Stratagene) to generate multiplestarting points for sequencing. Using the dideoxy thermalcycling methodology (Promega), "30% of the sequence wasdetermined from two external primers flanking the multiplecloning sites in pBluescript. The remaining sequence wasdetermined from new, internal primers, extending previoussequence data. About 30% of the sequence was determinedfrom both DNA strands; the remaining 70% of the sequenceis based on unambiguous sequencing gels.

RESULTS AND DISCUSSIONDNA Sequence of the ALF Gene. The entire DNA sequence

of the human ALF gene (24,454 bp), including the 5'- and3'-flanking regions, was determined from three overlapping A

Abbreviations: ALB, albumin; ALF, a-albumin; AFP, a-fetoprotein;DBP, vitamin D-binding protein.Data deposition: The sequence reported in this paper has beendeposited in the GenBank data base (accession no. U51243).*To whom reprint requests should be addressed.

7557

The publication costs of this article were defrayed in part by page chargepayment. This article must therefore be hereby marked "advertisement" inaccordance with 18 U.S.C. §1734 solely to indicate this fact.

Dow

nloa

ded

by g

uest

on

Nov

embe

r 8,

202

1

Page 2: Complete structure ofthe human a-albumin gene, …

7558 Biochemistry: Nishio and Dugaiczyk

Alu 1 Kpn 1 Alu 2 Alu3 Alu 4

.........- ~~~~~~~..:. ,.,-, ,,,co - ~~~~~~~~~~~~. . .. . . .. . ..

2077 1 .11469 8931 1808I .- 11

49 133 212 133 98 130 215

Alu 5

*.;.....;..-..... -,..... ................

....--

3215 2219 11364 :. 1757M133 98 133 224

EcoRI

Kpn 2TGA -

.....................

1559 ll13Ofl765rl ..

* Li133 61 126

_~~~~o3 EcoRl

EcoRI

4.2

3.3

6.0 0.8

-f .

11.6

015 1

I

3.2

I I. I _

-0.89 0 2 4 6 8 10 12 14 16 18 20 22 23.57

DNA LENGTH (kb)

FIG. 1. Map of the human a-albumin gene. The map shows the location of exons (boxes), introns (lines), EcoRI restriction sites, and repetitiveDNA elements. Untranslated portions of exons are indicated by open boxes, while coding sequences are indicated by filled boxes. Sizes of exons

and introns are given in bp, and position and orientation of repetitive DNA elements are indicated by arrows. EcoRI restriction maps of three Aclones used to generate the sequence data are shown below the genomic map.

phage clones (Figs. 1 and 2). The sequence is accessible fromthe GenBank data base under the accession no. U51243.Overall, the sequence is A+T-rich (64.0%), similar to the othergenes of this family. It starts at the EcoRI site at nucleotideposition - 887 and extends to the EcoRI site at position 23,567in Fig. 1. The gene itself spans 22,256 bp from the putativetranscription start point to the polyadenylylation site. The geneis composed of 15 exons, the positions of which were assignedby comparison with the ALF cDNA sequence (10). Exons 1and 14 are partially untranslated while the last exon is entirelyuntranslated. The consensus sequence of the exon/intronjunction (exon/GTAAG for the donor site) and the intron/exon junction (CAG/exon for the acceptor site) conform wellwith known splice site junction consensus sequences (18).The nucleotide sequence of the exons and the 3'-

untranslated region is identical to the previously reportedcDNA sequence (10). However, in the 5'-untranslated region,the genomic sequence we determined agrees up to the cytosinewithin the putative cap site (Fig. 2) with the cDNA sequence;further upstream the genomic sequence differs completelyfrom that of the cDNA data (10). This discrepancy in the5'-untranslated region has been noted previously (19), and itcould be due to difficulties in cloning 5'-ends of mRNAs.

Regulatory Motifs. In the 5'-flanking region, a TATAAAAsequence is located at positions -30 through -24, relative tothe putative cap site (Fig. 2). Although a CCAAT box is foundin the ALB gene and an equivalent CCAAC sequence in theAFP gene, this element is not found in either in the ALF geneor the DBP gene. The promoter contains the conservedsequences that correspond to the hepatocyte nuclear factor1-binding site at positions -62 through -50 (5'-GT-TACTTTTTTAC) and at positions -137 through -125 (5'-GTTAATATTTAGC). Two consensus hepatocyte nuclearfactor 1-binding sites are present at orthologous positions inthe AFP promoter region, while there is only one in the ALBpromoter. In the AFP promoter region, the consensus glu-cocorticoid receptor binding site is present upstream of thedistal hepatocyte nuclear factor 1-binding site, functioning torepress the expression of the AFP gene. Although the 5'-regionof the distal hepatocyte nuclear factor 1-binding site in the

ALF gene is flanked by a stretch of polypyrimidines, which arealso observed in the 5'-region of the glucocorticoid receptor-binding site in the AFP promoter, there is no glucocorticoidreceptor site in the ALF promoter.Intron/Exon Splicing Pattern and S-S Disulfide Bridges.

The ALF gene is composed of 15 exons and has the sameintron/exon splicing pattern that was found for the ALB (6),AFP (7), and DBP (8) genes (Fig. 3 and 4). Except for the firstand last introns, the remaining 12 introns split the codingsequence in an alternating pattern: within a codon, betweencodon positions 2 and 3 (phase 2), and between two codons(phase 0). This is the most characteristic feature in thestructure of all members of the albumin gene family. Althoughthe DBP gene lost two internal exons (exons 12 and 13), theremaining DBP gene structure conforms to the same intron/exon splitting pattern that is found in the other members of thisfamily. At the amino acid level, each protein forms a three-domain structure, and each domain is composed of severalpolypeptide loops. The sizes of the loops are determined by thesame pattern of symmetrically placed disulfide bonds (6-8,10). These characteristic features are the most conserved in thegene and protein structures of- members of this gene family.The four genes are also tandemly linked, 5'-ALB-5'-AFP-5'-ALF-5'-DBP-centromere, in the subcentromeric region ofhuman chromosome 4q, and the intergenic distances betweenthem are similar in size (20-40 kb) to the sizes of the individualgenes (14). This genomic arrangement seems to leave littleroom for yet another gene to be accommodated in this cluster.Thus, the presently determined ALF gene structure most likelycompletes the structural analysis of this multigene family,unless there are other members upstream of ALB or down-stream of DBP. The combined data support the notion thatDBP was the first gene to diverge from a common progenitor,while AFP and ALF arose from a recent gene duplication.

Repetitive DNA Elements. The human ALF gene was in-vaded by multiple repetitive DNA elements (Fig. 1), sharingthis fate with the other three members of the gene family(6-8). There are five Alu repeats within the introns of thegene. Alu 1 (414 bp) is located 25 bp downstream from exon3 and is flanked by perfect 17-bp terminal repeats. Alu 2 (188

ATG

119

EcoRI

Proc. Natl. Acad. Sci. USA 93 (1996)

7.9

Dow

nloa

ded

by g

uest

on

Nov

embe

r 8,

202

1

Page 3: Complete structure ofthe human a-albumin gene, …

Biochemistry: Nishio and Dugaiczyk Proc. Nati. Acad. Sci. USA 93 (1996) 7559

-887GMATTCTTAAGACCTAGTCTC....(702 bp) .... GAAACACAAAGGTMTiITIITITTTITICTGGTTAATATTAGCAAGAATTCTGCAGAGTGATCAA.AAAAATCAAATACTCAGTAMTCAGAAATAGATTAAATAG -63

L e ad er pe pt i deCap site .Met Lys Leu Leu Lys Leu Thr Giy Phe Ilie Phe 11

GTTACT1TFMACTGATMATGTGAAAGAATGATATAA.AAACTTGA1TMCCTCMACMACATT ACMTClTMGTAA.ATGTGGMTCTACAAAG ATG AAA CTA CTA MAA CUr ACA GGT lM ATT lM 64

Phe Leu Phe Phe Leu Thr Giu Ser Leu Thr Leu Pro Thr Gin Pro Arg Asp Ilie G I n tro n 1 -*(2077 bp) 29UTC HTG TUMTU TG ACT GMA TCC CTA ACC CTG CCC ACA CMA CCT CGG GAT ATA G GTAAGiAAAMTACTTGTATATCAGCTAAAGCATGAiCC-.... (2000 bp).....MCCATGCTATG 2168

iu Asn Phe Asn Ser Thr Gin Lys Phe Ilie Giu Asp Asn Ilie Giu Tyr II I n t r o n 2 --*(268 bp) 45TMTAACCTAAATATTCTTGC1TMCAG AG MAC TTC MAT AGT ACT CMA MAA MT ATA GMA GAT MAT AUT GMA TAC AT GTGAGTTGTGCTAAATACT1IIIT GATGATGAT!TMAAAATGATA 2290

(179bp)... ~~~~~~~.e Thr Ilie Ilie Aia Phe Aia Gin Tyr Vai Gin Giu Aia Thr Phe Giu Giu Met Giu Lys 65

TAUTCTAGAMATGUTGCATT ....(7 p..A'TACCCCATGTGMACTGTTGCAG C ACC ATC AUT GCA MT GCT CAG TAT GUT CAG GMA GCA ACC MT GMA GMA ATG GMA MAG 2571

Leu Vai Lys Asp Met Vai GIu Tyr Lys Asp Arg Cys Met Ala Asp Lys Thr Leu Pro Glu Cys Ser Lys Leu Pro I n t r o n 3 -*(1469 bp) 90CTG GTG MAA GAC ATG GTA GMA TAG MAA GAC AGA TGT ATG GCT GAC MAG ACG CTC CCA GAG TGT TCA MAA UTA CCT GTMAGTMAAATGCTTGTGMTUCUCCTMTATCTTATG 2686

Asn Asn Vai Leu Gin Giu Lys Ilie Cys Aia Met Giu Gly Leu Pro Gin Lys His Asn 109TCTTCUTCTCTTCTUCT.... (1385 bp).....TCCTMCTTCCTCTGTTGTATAG MAT MT GTT HA CAG GMA MAA ATA TGT GCT ATG GAG GGG CTG CCA CMA MAG CAT MAT 4172

Phe Ser His Cys Cys Ser Lys Vai Asp Aia Gin Arg Arg Leu Cys Phe Phe Tyr Asn Lys Lys Ser Asp Vai Giy Phe Leu Pro Pro Phe Pro Thr Leu Asp Pro 144HTCTCA CAC TGC TGC AGTMAG GUrGAT GCT CMAAGA AGA CTC TGTHTCHTC TATMAC AGMAAATCT GATGTGGGAUT CTG CCT CHCTTCCCTACCCCTGGAT CCC 4277

Giu GIu Lys Cys Gin Ala Tyr Giu Ser Asn Arg Giu Ser Leu Leu Asn Hi I n tro n 4-* ,(893 bp) 160GMA GAG MAA TGC CAG GCT TAT GMA AGT MAC AGA GMA TCC CUT HTA MAT CA GTMAGMAAMTCTTAGTMAAA.ATGATCCAGTTq. (825 bp) ....CCCGTCTTCTCTCMTCATUT 5208

s Phe Leu Tyr Glu Val Aia Arg Arg Asn Pro Phe Vai Phe Ala Pro Thr Leu Leu Thr Vai Ala Vai His Phe Giu GIu Val Ala Lys Ser Cys 192TA1TnTATAG CMTTUA TAT GAAGTTGCC AGAAGGMAC CCA1TFGTCUTC GCCCCTACA CTrCTAACTGTGUGCT GTTCAT MGAG GAG GTGGCCMAAATCA TGT 5314

Cys Glu Glu GIn Asn Lys Val Asn Cys Leu GIn Thr Arg I n t r on 5- (624 bp) 205TGT GMA GMA CMA AAC MAA GTC AAC TGC CUT CMA ACA AGG GTGGGTATAGCAMGTGUCCATGMAGAGGATMAGAMATCACTCMATACC....(538 bp) .....GCTATUGMTCAAAAGiAA1TM 5962

Ala Ilie Pro Vai Thr Gin Tyr Leu Lys Ala Phe Ser Ser Tyr Gin Lys His Vai Cys Giy Aia Leu Leu Lys Phe Gly Thr Lys Vai Val His 236CTCHCTCUCUCAG GCA ATA CCT GTC ACA CMA TAT HTA MAA GCA UT TCT TCT TAT CMA MAA CAT GTC TGT GGG GCA CUT HTG MAA TU GGA AGC MAA GUT GTA CAC 6070

Phe Ii1 I n t r o n 6 - (808 bp) ... e Tyr Ilie Ala Ilie Leu Ser Gin Lys 246UT AT GTGAGTUTATACTATATGTCHTGTCTCTGCTTGCAUTCCHTGTCTGTGTCCC....(727 bp) ....MUGTGGMTATCAATCTCMTCCAG A TAT AUT GCG ATA CTC AGT CMA MAA 6908

Phe Pro Lys Ilie Giu Phe Lys Giu Leu Ilie Ser Leu Val Glu Asp Vai Ser Ser Asn Tyr Asp Giy Cys Cys Glu Giy Asp Vai Vai Gin Cys Ilie Arg Asp Thr 281TTCCCCMGAUGTTTGAMAG GAG CTTATTTCTCHTGTAGMGAAGTG TTTCT TCCCTACTGATGGATGATGC TGT GAAGGGGAT GTTGTGCATCAGG TC CGTGACACGG 7013

I nt r on 7-*0(3111lbp)GTGAATATTCTCTAAA.ACCAAGTTAAAATAGTGATTTMGG....(747 bp) .....GGTATGMUTCATCAT.... (2258 bp).....GAAAACCCAGCCTGAAMGTAAAATMATCTCTCAUrCUTTTMGGTACAG 10124

Ser Lys Vai Met Asn His Ilie Cys Ser Lys Gin Asp Ser Ilie Ser Ser Lys Ilie Lys Giu Cys Cys Glu Lys Lys Ilie Pro Giu Arg Gly Gin Cys Ilie Ilie Asn 316AGCMAG GUTATGMAC CAT AUTTGT TCAMAAACMAGAT TCT ATC TCC AGC AAMATCMAAAGAG TGC TGT GMAAGMAAAATA CCA GAG CGC GGC CAG TGC ATA AUMAC 10229

Ser Asn Lys Asp Asp Arg Pro Lys Asp Leu Ser Leu Arg GIu Giy Lys Phe Thr Asp Ser Giu Asn Val Cys Gin Giu Arg Asp Ala Asp Pro Asp Thr Phe Phe 351TCA MCMAAAGAT GAT AGA CCAMAG GATHTA TCT CTA AGA GAAGGAMAAUAMATGAC AGT GAAMT GTG TGT CAAGAACGA GAT GCT GAC CCA GAC ACUCUTTM 10334

AliaLy In t r on 8-. (3215 bp) ssPhe Thr Phe GiuTyr Ser Arg Arg His 362GCG MA GTMATATAACTCMATTAUGMATMAGGGMACAGAMAGA.AACTUGTAAGAT.... (3138 bp).....ATACAAAA1TMACACATTGCAG G UT ACT UT GMA TAG TCA AGG AGA CAT 13582

Pro Asp Leu Ser Ilie Pro Glu Leu Leu Arg Ilie Val Gin Ilie Tyr Lys Asp Leu Leu Arg Asn Cys Cys Asn Thr Glu Asn Pro Pro Gly Cys Tyr Arg Tyr Ala 397CCA GAC CTG TCTATA CCAGAGCTTTA AGAATTGTTCMAATATATACAAGAT CTCCTGAMAAT TGC TGCMAC ACA GAAAC CCTCCA GGTTGTTAC CGTTACGGCG 13687

I n t r o n 9 -*-- (2219 bp) ... Glu Asp Lys Phe Asn Glu Thr Thr Glu Lys Ser Leu Lys Met 411GTAGGTTCCATUGTTGTAGGTTCAGAAMATCMAMAAGMAC.... (2155 bp) .....ACATGATTATTCTrATITCAG GMA GAG MAA UrC MT GAG AGA ACT GAG MAA AGC CTC MAG ATG 15948

Val Gin Gin Glu Cys Lys His Phe Gin Asn Leu Gly Lys Asp Gly Leu Lys Tyr Hi I n t r o n 10 - (1364 bp) 429GTA CMA CMA GMA TGT MAA CAT UTC CAG MAT HTG GGG MG GAT GGT HTG AMA TAG CA GTATGTUMCAAGUTGGGTAAATGCCAC.... (1299 bp).....CAGMACCAAACA 17351

s Tyr Leu Ilie Arg Leu Thr Lys Ilie Ala Pro Gin Leu Ser Thr Glu Glu Leu Val Ser Leu Gly Glu Lys Met Val Thr Ala Phe Thr Thr 460GACAAATGTTCUTCAG T TAG CTC ATC AGG CTC ACG MAG ATA GCT CCC CMA CTC TCC ACT GMA GMA CTG GTG TCT CUT GGC GAG MAA ATG GTG ACA GCT UTC ACT ACT 17459

Cys Cys Thr Leu Ser Glu Gliu Phe Ala Cys Val Asp Asn Leu I n t r o n 11 -* (757 bp). 474TGC TGT ACG CTA AGT GMA GAG UT GCC TGT GUT GAT MAT HTG GTGAGCATGGCCTGTGTACCAGACTACTCT1TTITrTT111TTMA....(690 bp).... GTTGCMACTCTTGTTGGTACAG 18258

Ala Asp Leu Val Phe Gly Glu Leu Cys Gly Val Asn Gliu Asn Arg Thr Ilie Asn Pro Ala Val Asp His Cys Cys Lys Thr Asn Phe Ala Phe Arg Arg Pro Cys 509GCA GATHTA GUTTMGGA GAGHTA TGT GGAGTAMAT GAAMTCGA ACTATCMAC CCT GCTGTG GAC CAC TGC TGTMAAA CAMAACMGCCTC AGA AGG CCC TGC 18363

Phe Glu Ser Leu Lys Ala Asp Lys Thr Tyr Val Pro Pro Pro Phe Ser Gin Asp Leu Phe Thr Phe His Ala Asp Met Cys Gin Ser GIn Asn GIu GIu Leu GIn 544MTGAG AGTUTGMAAAGCT GATMAAAAATAT GTG CCT CCA CCTUTC TCT CAAGATHAUTAGMACUTTTCACGCA GAC ATGTGT CAATCT CAGMATGAG GAG CTTCAG 18468

Arg Lys Thr Asp Ar I nt r on 12-*p(1559 bp) 548AGG MAG ACA GAC AG GTACAAATMATCTCTTCCACCTTTTTCTTMCGGUTGMAGACACACCATGTATTAGTGA....(827 bp) .... GTUGGAATTCTATMAGTTACTAGAATACTGGMUTCTAGMA 19413

g Phe Leu Val Asn Leu Val Lys Leu Lys His Glu Leu Thr Asp Glu Glu Leu Gin Ser 568MTACTAGAAAATMAGGAAAT....(584 bp).....TAUTTGACATCTTTTGGCCACAG G UT CUT GTC MAC HTA GTG MAG CTG MAG CAT GMA CTC ACA GAT GMA GAG CTG CAG TCT 20099

Leu Phe Thr Asn Phe Ala Asn Val Val Asp Lys Cys Cys Lys Ala Gliu Ser Pro Glu Val Cys Phe Asn GIu Gliu I n t r o n 13 -~ (1130 bp) 593HTG UT ACA MAT U-C GCA MAT GTA GTG GAT MAG TGC TGC MAA GCA GAG AGT CCT GMA GTC TGC UT MT GMA GAG GTAGTMTAMC1TTCTACTGAAATTCAACTGGTMTC 20214

Ser Pro Lys Ilie Gly Asn Ter 599CTGGTCAAATAAA.ATAAAACAGMACCCTGCAGGACACTGCT.... (1026 bp) .... UTAUT CCATCCCTCACCTCAG AGT CCA MAA AUT GGC MAC TGA AGCCAGCTGCTGGAGATATGTAAAGAAA.AA 21355

I nt r on 14-*o(765 bp) E x o n 15 -*(126 bp: untranslated)AGCACCMAAG GTMATACCCTCTGCCTCAHTCMAATGTCAAATGTAUMTGT....(702 bp) .... GAAATACTATCTTCTCUTCCAG GGAAGGCUTCCTATCTGTGTGGTGATGAATCGCAUTCCTGAGMACMA 22178

MATAAAAGGATTTTCTGTAACTGTCACCTGAA.ATAATAAHTGCAGCAAGCMATAACACMCATrGMAGHAiTiTTTCAGACCGATGTCATAAA.... (1267 bp) .. .. CMACMGTAMATAHAATHC 23567

FIG. 2. Nucleotide sequence of the human a-albumin gene. Only prominent features of the sequence are shown, such as 5'- and 3'-flankingregions, regulatory motifs, the amino acid sequence encoded by exons, and intron/exon junctions. The entire sequence of 24,454 bp, spanning mappositions -0.89 kb through 23.57 kb in Fig. 1 has been deposited in GenBank data base under accession no. U51243. Numbering of amino acidsand nucleotides within the entire sequence is shown to the right of each line, with positive numbers for nucleotides starting at the putative cap site.Not all intron sequences are shown to conserve space; gaps in their presentation (... .) were introduced, and the sizes of these gaps are indicatedin bp. EcoRI restriction sites (double underline) and a polyadenylylation signal (open circles) are shown below the sequence; a transcription start(cap) and a polyadenylylation site are indicated above the sequence.

Dow

nloa

ded

by g

uest

on

Nov

embe

r 8,

202

1

Page 4: Complete structure ofthe human a-albumin gene, …

7560 Biochemistry: Nishio and Dugaiczyk

@ ~ ~~~~~1 1*23X - N H~~~~N2

DOMAIN 1I&

1~~~~~~~~~~~~~~~~~~~~*1t~~~~~~~~e YPr (Er

DOMAIN 2 ~~~~~~~~~~~~~~~~~roX 1 ~~~~~~~1*2313

DOMAINN33

I~~~~~~~~~~ OM10

If COOHO

FIG. 3. Positions of intervening sequences within the structure of human a-albumin. Except for intron 1, the remaining 12 introns split thepolypeptide chain in an alternating way at exactly equivalent positions within each domain. The positions at which individual codons are split byintrons are shown by numbers above relevant amino acids. The intron positions yield a highly symmetrical arrangement in the polypeptide chainbetween each domain. The predicted amino acid sequence and the disulfide bonds are displayed according to the model of Brown (1).

Proc. Natl. Acad. Sci. USA 93 (1996)

Dow

nloa

ded

by g

uest

on

Nov

embe

r 8,

202

1

Page 5: Complete structure ofthe human a-albumin gene, …

Proc. Natl. Acad. Sci. USA 93 (1996) 7561

(Met) (Ter)ATG TAA pol

2 o 2 o 2 o 2 o 2 o 2 o

ALB 1832 549 824 1399 108 1177 7418 :192$61118 58 133 212 133 98 130 215 133 98 139 224 133 68 163

ATG TAA poly A1 2 o~~~2 o 2 o 2 o 2 o 2 o

AFP 812 228 1 486 918- 15482275-16575682 482- 1647-' 4 133

129 52 133 212 133 98 130 215 133 98 139 224 133 55 145

ATG TGA poly Al 2 o 2 o 2 o 2 o 2 o 2 o

ALF 077 268- 1469 893-6242 88311 2219 364 757:1559- 13

119 49 133 212 133 98 130 215 133 98 133 224 133 61 126

ATG TAG poly A2 o 2 o 2 o 2 o 2 o

DBP 1 68897 2; 65;495 1, 523413 61' 468- 1760309 4

119 70 133 212 133 95 130 203 130 98 133 55 175

Exon# 1 2 3 4 8 9 10 11 12 13 14 15

FIG. 4. Comparison of the intron/exon organization of the human serum albumin (ALB), a-fetoprotein (AFP), a-albumin (ALF), and vitaminD-binding protein (DBP) genes. Exons for each gene are indicated by boxes; their lengths in bp are below the boxes. Coding exons are in black,noncoding regions in white. Introns (not drawn to scale) are shown as heavy line, their sizes (in bp) are within the line. Phases of introns are indicatedin the broken line connecting the exon boxes. Phase 1, an intron between the first and the second nucleotide within a codon; phase 2, an intronbetween the second and the third nucleotide within a codon; and phase 0, an intron between two codons. The translation initiation and terminationcodons of each gene are shown in exon 1 and 14, respectively. Exon numbers are on the bottom line.

bp), Alu 3 (220 bp), Alu 4 (139 bp), and Alu 5 (276 bp) areincomplete, 3'-portions of the full-size Alu element. Alu 3 isflanked by 8-bp terminal repeats with one mismatch, and Alu4 is flanked by 14-bp terminal repeats with three mismatches.The Kpn 1 (258 bp) element in intron 7 is a partial 3'-endsequence of the full-size Kpn repeat, and it is flanked by12-bp terminal repeats with one mismatch. The Kpn 2 (248bp) element in the 3'-flanking region of the gene representsalso the 3'-portion of full-size Kpn; it is flanked by 7-bpterminal repeats with one mismatch. In addition, there is atransposon-like human element -6 kb upstream of exon 1 ofthe ALF gene, which is beyond the sequence reported in thepresent work. Considering that these repeats are primate-specific, the timing of their emergence may provide reliablesynapomorphic markers, capable of deciphering branchingorders in primate phylogeny. The completion of the struc-tural analysis of the albumin gene family should also providea well-characterized locus in the subcentromeric region ofhuman chromosome 4 for future genetic studies on thehuman genome.

This work was supported by Grant CA-R-BCH-5820-H from theUniversity of California.

1. Brown, J. R. (1976) Fed. Proc. 35, 2141-2144.2. Law, S. W. & Dugaiczyk, A. (1981) Nature (London) 291, 201-

205.3. Jagodzinski, L. L., Sargent, T. D., Yang, M., Glackin, C. &

Bonner, J. (1981) Proc. Natl. Acad. Sci. USA 78, 3521-3525.4. Cooke, N. E. & David, E. V. (1985)J. Clin. Invest. 76, 2420-2424.

5. Yang, F., Brune, J. L., Naylor, S. L., Cupples, R. L., Naberhaus,K. H. & Bowman, B. H. (1985) Proc. Natl. Acad. Sci. USA 82,7994-7998.

6. Minghetti, P. P., Ruffner, D. E., Kuang, W. J., Dennison, 0. E.,Hawkins, J. W., Beattie, W. G. & Dugaiczyk, A. (1986) J. BIQl.Chem. -261, 6747-6757.

7. Gibbs, P. E. M., Zielinski, R., Boyd, C. & Dugaiczyk, A. (1987)Biochemistry 26, 1332-1343.

8. Witke, W. F., Gibbs, P. E. M., Zielinski, R., Yang, F., Bowman,B. H. & Dugaiczyk, A. (1993) Genomics 16, 751-754.

9. Belanger, L., Roy, S. & Allard, D. (1994) J. Biol. Chem. 269,5481-5484.

10. Lichenstein, H. S., Lyons, D. E., Wurfel, M. M., Johnson, D. A.,McGinley, M. D., Leidli, J. C., Trollinger, D. B., Mayer, J. P.,Wright, S. D. & Zukowski, M. M. (1994) J. Biol. Chem. 269,18149-18154.

11. Harper, M. E. & Dugaiczyk, A. (1983) Am. J. Hum. Genet. 35,565-572.

12. Cooke, N. E., Willard, H. F.; David, E. V. & George, D. L. (1986)Hum. Genet. 73, 225-229.

13. Urano, Y., Sakai, M., Watanabe, K. & Tamaoki, T. (1984) Gene32, 255-261.

14. Nishio, H., Heiskanen, M., Palotie, A., Belanger, L. & Dugaiczyk,A. (1996) J. Mol. Biol., 259, 113-119.

15. Yunis, J. J. & Prakash, 0. (1982) Science 215, 1525-1530.16. Magenis, R. E., Sheehy, R., Dugaiczyk, A. & Gibbs, P. E. M.

(1987)- Cytogenet. Cell Genet. 46, 654.17. Magenis, R. E., Luo, X. Y., Dugaiczyk, A., Ryan, S. C. & Oes-

terhuis, J. E. (1989) Cytogenet. Cell Genet. 51, 1037.18. Mount, S. M. (1982) Nucleic Acids Res. 10, 459-472.19. Allard, D., Gilbert, S., Lamontagne, A., Hamel, D. & Belanger,

L. (1995) Gene 153, 287-288.

Biochemistry: Nishio and Dugaiczyk

Dow

nloa

ded

by g

uest

on

Nov

embe

r 8,

202

1