human genome project - wordpress.com · human genome project 5 since the inception of the hgp, and...

27
Atul Nag and Rohit Farmer Allahabad Agricultural Institute – Deemed University 6/11/2007 Understanding ……….

Upload: others

Post on 27-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

Allahabad Agricultural Institute – Deemed

University

6/11/2007

Understanding ……….

Page 2: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

1 �

Primer

What is a Genome?

A genome is an organism's complete set of deoxyribonucleic acid (DNA), a chemical compound that contains the genetic instructions needed to develop and direct the activities of every organism. DNA molecules are made of two twisting, paired strands. Each strand is made of four chemical units, called nucleotide bases. The bases are adenine (A), thymine (T), guanine (G) and cytosine (C). Bases on opposite strands pair specifically; an A always pairs with a T, and a C always with a G.

The human genome contains approximately 3 billion of these base pairs, which reside in the 23 pairs of chromosomes within the nucleus of all our cells. Each chromosome contains hundreds to thousands of genes, which carry the instructions for making proteins. Each of the estimated 30,000 genes in the human genome makes an average of three proteins.

What is genome sequencing?

By itself, not a whole lot. Genome sequencing is often compared to "decoding," but a sequence is still very much in code. In a sense, a genome sequence is simply a very long string of letters in a mysterious language.

When you read a sentence, the meaning is not just in the sequence of the letters. It is also in the words those letters make and in the grammar of the language. Similarly, the human genome is more than just its sequence.

Imagine the genome as a book written without capitalization or punctuation, without breaks between words, sentences, or paragraphs, and with strings of nonsense letters scattered between and even within sentences. A passage from such a book in English might look like this:

Even in a familiar language it is difficult to pick out the meaning of the passage: The quick brown fox jumped over the lazy dog. The dog lay quietly dreaming of dinner. And the genome is "written" in a far less familiar language, multiplying the difficulties involved in reading it.

So sequencing the genome doesn't immediately lay open the genetic secrets of an entire species. Even with a rough draft of the human genome sequence in hand, much work remains

Page 3: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

2 �

to be done. Scientists still have to translate those strings of letters into an understanding of how the genome works: what the various genes that make up the genome do, how different genes are related, and how the various parts of the genome are coordinated. That is, they have to figure out what those letters of the genome sequence mean.

Why is genome sequencing so important?

Sequencing the genome is an important step towards understanding it.

At the very least, the genome sequence will represent a valuable shortcut, helping scientists find genes much more easily and quickly. A genome sequence does contain some clues about where genes are, even though scientists are just learning to interpret these clues.

Scientists also hope that being able to study the entire genome sequence will help them understand how the genome as a whole works—how genes work together to direct the growth, development and maintenance of an entire organism.

Finally, genes account for less than 25 percent of the DNA in the genome, and so knowing the entire genome sequence will help scientists study the parts of the genome outside the genes. This includes the regulatory regions that control how genes are turned on an off, as well as long stretches of "nonsense" or "junk" DNA—so called because we don't yet know what, if anything, it does.

How do you sequence a genome?

The quick answer to this question is: in pieces. The whole genome can't be sequenced all at once because available methods of DNA sequencing can only handle short stretches of DNA at a time.

So instead, scientists must break the genome into small pieces, sequence the pieces, and then reassemble them in the proper order to arrive at the sequence of the whole genome. Much of the work involved in sequencing lies in putting together this giant biological jigsaw puzzle.

There are two approaches to the task of cutting up the genome and putting it back together again. One strategy, known as the "clone-by-clone" approach, involves first breaking the genome up into relatively large chunks, called clones, about 150,000 base pairs (bp) long. Scientists use genome mapping techniques to figure out where in the genome each clone belongs. Next they cut each clone into smaller, overlapping pieces the right size for sequencing—about 500 BP each. Finally, they sequence the pieces and use the overlaps to reconstruct the sequence of the whole clone. The other strategy, called "whole-genome shotgun" method, involves breaking the genome up into small pieces, sequencing the pieces, and reassembling the pieces into the full genome sequence.

Page 4: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

3 �

Each of these approaches has advantages and disadvantages. The clone-by-clone method is reliable but slow, and the mapping step can be especially time-consuming. By contrast, the whole-genome shotgun method is potentially very fast, but it can be extremely difficult to put together so many tiny pieces of sequence all at once.

Both approaches have already been used to sequence whole genomes. The whole-genome shotgun method was used to sequence the genome of the bacterium Haemophilus influenzae, while the genome of baker's yeast, Saccharomyces cerevisiae, was sequenced with a clone-by-clone method. Sequencing the human genome was done using both approaches.

A Brief Historical Overview

Though surprising to many, the Human Genome Project (HGP) traces its roots to an initiative in the U.S. Department of Energy (DOE). Since 1947, DOE and its predecessor agencies have been charged by Congress with developing new energy resources and technologies and pursuing a deeper understanding of potential health and environmental risks posed by their production and use. Such studies, for example, have provided the scientific basis for individual risk assessments of nuclear medicine technologies. In 1986, DOE took a bold step in announcing the Human Genome Initiative, convinced that its missions would be well served by a reference human genome sequence. Shortly thereafter, DOE joined with the National Institutes of Health to develop a plan for a joint HGP that officially began in 1990. During the early years of the HGP, the Welcome Trust, a private charitable institution in the United Kingdom, joined the effort as a major partner. Important contributions also came from other collaborators around the world, including Japan, France, Germany, and China.

In 1976, the genome of the virus Bacteriophage MS2 was the first complete genome to be determined, by Walter Fiers and his team at the University of Ghent (Ghent, Belgium). The idea for the shotgun technique came from the use of an algorithm that combined sequence information from many small fragments of DNA to reconstruct a genome. This technique was pioneered by Frederick Sanger to sequence the genome of the Phage Φ-X174, a tiny virus called a bacteriophage that was the first fully sequenced genome (DNA-sequence) in 1977. The technique was called shotgun sequencing because the genome was broken into millions of pieces as if it had been blasted with a shotgun. In order to scale up the method, both the sequencing and genome assembly had to be automated, as they were in the 1980s.

Those techniques were shown applicable to sequencing of the first free-living bacterial genome (1.8 million base pairs) of Haemophilus influenzae in 1995 and the first animal genome (~100 Mbp) It involved the use of automated sequencers, longer individual sequences using approximately 500 base pairs at that time. Paired sequences separated by a fixed distance of around 2000 base pairs which were critical elements enabling the development of the first genome assembly programs for reconstruction of large regions of genomes (aka 'contigs').

Three years later, in 1998, the announcement by the newly-formed Celera Genomics that it would scale up the shotgun sequencing method to the human genome was greeted with skepticism in some circles. The shotgun technique breaks the DNA into fragments of various

Page 5: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

4 �

sizes, ranging from 2,000 to 300,000 base pairs in length, forming what is called a DNA "library". Using an automated DNA sequencer the DNA is read in 800bp lengths from both ends of each fragment. Using a complex genome assembly algorithm and a supercomputer, the pieces are combined and the genome can be reconstructed from the millions of short, 800 base pair fragments. The success of both the public and privately funded effort hinged upon a new, more highly automated capillary DNA sequencing machine, called the Applied Biosystems 3700, that ran the DNA sequences through an extremely fine capillary tube rather than a flat gel. Even more critical was the development of a new, larger-scale genome assembly program, which could handle the 30-50 million sequences that would be required to sequence the entire human genome with this method. At the time, such a program did not exist. One of the first major projects at Celera Genomics was the development of this assembler, which was written in parallel with the construction of a large, highly automated genome sequencing factory. The first version of this assembler was demonstrated in 2000, when the Celera team joined forces with Professor Gerald Rubin to sequence the fruit fly Drosophila melanogaster using the whole-genome shotgun method. At 130 million base pairs, it was at least 10 times larger than any genome previously shotgun assembled. One year later, the Celera team published their assembly of the three thousand million base pair human genome.

Ambitious Goals

The HGP’s ultimate goal was to generate a high-quality reference DNA sequence for the human genome‘s 3 billion base pairs and to identify all human genes. Other important goals included sequencing the genomes of model organisms to interpret human DNA, enhancing computational resources to support future research and commercial applications, exploring gene function through mouse-human comparisons, studying human variation, and training future scientists in genomics. The powerful analytical technology and data arising from the HGP present complex ethical and policy issues for individuals and society. These challenges include privacy, fairness in use and access of genomic information, reproductive and clinical issues, and commercialization. Programs that identify and address these implications have been an integral part of the HGP and have become a model for bioethics programs worldwide.

A Lasting Legacy

In June 2000, to much excitement and fanfare, scientists announced the completion of the first working draft of the entire human genome. First analyses of the details appeared in the February 2001 issues of the journals Nature and Science. The high-quality reference sequence was completed in April 2003, marking the end of the Human Genome Project—2 years ahead of the original schedule. Coincidentally, it also was the 50th anniversary of Watson and Crick’s publication of DNA structure that launched the era of molecular biology. Available to researchers worldwide, the human genome reference sequence provides a magnificent and unprecedented biological resource that will serve throughout the century as a basis for research and discovery and, ultimately, myriad practical applications. The sequence already is having an impact on finding genes associated with human disease .Hundreds of other genome sequence projects—on microbes, plants, and animals—have been completed

Page 6: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

5 �

since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights About the Human Genome

The first panoramic views of the human genetic landscape have revealed a wealth of information

and some early surprises. Much remains to be deciphered in this vast trove of information; as the consortium of HGP scientists concluded in their seminal paper, “. . .the more we learn about the human genome, the more there is to explore.” A few highlights follow from the first publications analyzing the sequence.

1. The genome is the complete list of coded instructions needed to make a person. 2. The 4 letters in the DNA alphabet A, C, G and T are used to carry the instructions for

making all organisms. The order (or sequence) of these letters holds the code just like the order of letters that makes words mean something. Each set of three letters corresponds to a single amino acid.

3. The information would fill a stack of paperback books 200 feet high. 4. The information would fill two hundred 500-page telephone directories. 5. Between humans, our DNA differs by only 0.2%, or 1 in 500 bases (letters). (This

takes into account that human cells have two copies of the genome.) 6. If we recited the genome at one letter per second for 24 hours a day it would take a

century to recite the book of life. 7. If two different people started reciting their individual books at a rate of one letter per

second, it would take nearly eight and a half minutes (500 seconds) before they reached a difference.

8. A typist typing at 60 words per minute (around 360 letters) for 8 hours a day would take around 50 years to type the book of life.

9. Our DNA is 98% identical to that of chimpanzees.

Page 7: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

6 �

10. The estimated

number of genes in both humans and mice is

60,000-100,000; in the round worm (C. elegans), the number is approximately 19,000; in yeast (S.

cerevisiae) there are around 6,000 genes; and the microbe

responsible for

tuberculosis has around 4,000.

11. The vast majority of DNA in the

human genome - 97% - has no

known function. 12. The first chromosome to be completely decoded was chromosome 22 at the Sanger

Centre in Cambridgeshire, in December 1999. 13. There is 6 feet of DNA in each of our cells packed into a structure only 0.0004 inches

across (it would easily fit on the head of a pin). 14. There are 3 billion (3,000,000,000) letters in the DNA code in every cell in your

body. 15. There are 100 trillion (100,000,000,000,000) cells in the body. 16. If all the DNA in the human body was put end to end it would reach to the sun and

back over 600 times (100 trillion x 6 feet divided by 93 million miles = 1200). 17. 12,000 letters of DNA are decoded by the Human Genome Project every second. 18. If all 3 billion letters were spread out 1mm apart they would extend 3,000 km or about

7,000 times the height of the Empire State Building. 19. If all 3 billion letters were spread out 3mm apart they would extend 9,000km more

than twice the length of the Mississippi river at 3,779km.

Page 8: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

7 �

20. The human genome contains 3.2 billion chemical nucleotide base pairs (A, C, T, and G).

21. The average gene consists of 3000 bp, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bp.

22. Functions are unknown for more than 50% of discovered genes. 23. The human genome sequence is almost exactly the same (99.9%) in all people. 24. About 2% of the genome encodes instructions for the synthesis of proteins. 25. Repeat sequences that do not code for proteins make up at least 50% of the human

genome. 26. Repeat sequences are thought to have no direct functions, but they shed light on

chromosome structure and dynamics. Over time these repeats reshape the genome by rearranging it, thereby creating entirely new genes or modifying and reshuffling existing genes.

27. The human genome has a much greater portion (50%) of repeat sequences than the mustard weed(11%), the worm (7%), and the fly (3%).

28. Over 40% of predicted human proteins share similarity with fruit-fly or worm proteins.

29. Genes appear to be concentrated in random areas along the genome, with vast expanses of non coding DNA between.

30. Chromosome 1 (the largest human chromosome) has the most genes (2968), and the Y chromosome has the fewest (231).

31. Genes have been pinpointed and particular sequences in those genes associated with numerous diseases and disorders including breast cancer, muscle disease, deafness, and blindness.

32. Scientists have identified millions of locations where single-base DNA differences occur in humans. This information promises to revolutionize the processes of finding DNA sequences associated with such common diseases as cardiovascular disease, diabetes, arthritis, and cancers.

Page 9: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

8 �

Human Genome Project The Human Genome Project (HGP) officially began in 1990 and was coordinated in the United States by the NHGRI and the U.S. Department of Energy (DOE). International HGP partners included the United Kingdom, France, Germany, Japan and China. Once scientists completed the ultimate task of sequencing all 3 billion base pairs in the human genome, they created a virtual blueprint for a human being.

From 1990 to 1994, the activities of the HGP were primarily devoted to developing genetic and physical maps that allowed precise localization of genes, and exploring technologies that enabled the sequencing of very large amounts of DNA with high accuracy and low cost. Pilot projects were initiated in 1996 to explore the feasibility of such large-scale sequencing of human DNA. These projects were extremely successful and resulted in creative laboratory innovations that automated and accelerated the sequencing process. By September 1997, the pilot projects had sequenced approximately two percent of human DNA. Eventually, with current technology, HGP centers were able to sequence 1,000 base pairs per second at a very low cost.

Scientific leaders of the Human Genome Project also made an important decision in 1996 - to deposit sequence in public databases within 24 hours of its assembly, with no restrictions on its use or redistribution. This defining moment in the HGP made the sequence immediately available to anyone with an Internet connection, ensuring that the sequence would ultimately benefit the public by empowering all the world's best minds.

In June 2000, the International Human Genome Sequencing Consortium announced that a "working draft" sequence of the human genome, nearly 90 percent complete, had been produced. In February 2001, the consortium published this sequence and an initial analysis of the human genome that reported a number of discoveries. The most surprising of these was that humans have only 30,000 to 35,000 genes, whereas previous predictions had ranged from 80,000 to 150,000 genes.

The Human Genome Project's goal of producing a highly accurate "finished" sequence was met in April 2003 - under budget and two years ahead of the original schedule. With the completion of the HGP, the mission of the NHGRI has expanded to encompass a broad range of studies aimed at understanding the structure and function of the human genome and its role in health and disease.

Page 10: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

9 �

Participants of the International Human Genome Sequencing Consortium

The Human Genome Project could not have been completed as quickly and as effectively without the strong participation of international institutions. In the United States, contributors to the effort include the National Institutes of Health (NIH), which began participation in 1988 when it created the Office for Human Genome Research, later upgraded to the National Center for Human Genome Research in 1990 and then the National Human Genome Research Institute (NHGRI) in 1997; and the U.S. Department of Energy (DOE), where HGP discussions began as early as 1984. However, almost all of the actual sequencing of the genome was conducted at numerous universities and research centers throughout the United States, the United Kingdom, France, Germany, Japan and China.

The International Human Genome Sequencing Consortium includes:

1. The Whitehead Institute/MIT Center for Genome Research, Cambridge, Mass., U.S. 2. The Wellcome Trust Sanger Institute, The Wellcome Trust Genome Campus,

Hinxton, Cambridgeshire, U. K. 3. Washington University School of Medicine Genome Sequencing Center, St. Louis,

Mo., U.S. 4. United States DOE Joint Genome Institute, Walnut Creek, Calif., U.S. 5. Baylor College of Medicine Human Genome Sequencing Center, Department of

Molecular and Human Genetics, Houston, Tex., U.S. 6. RIKEN Genomic Sciences Center, Yokohama, Japan 7. Genoscope and CNRS UMR-8030, Evry, France 8. GTC Sequencing Center, Genome Therapeutics Corporation, Waltham, Mass., USA 9. Department of Genome Analysis, Institute of Molecular Biotechnology, Jena,

Germany 10. Beijing Genomics Institute/Human Genome Center, Institute of Genetics, Chinese

Academy of Sciences, Beijing, China 11. Multimegabase Sequencing Center, The Institute for Systems Biology, Seattle, Wash. 12. Stanford Genome Technology Center, Stanford, Calif., U.S. 13. Stanford Human Genome Center and Department of Genetics, Stanford University

School of Medicine, Stanford, Calif., U.S. 14. University of Washington Genome Center, Seattle, Wash., U.S. 15. Department of Molecular Biology, Keio University School of Medicine, Tokyo, Japan 16. University of Texas Southwestern Medical Center at Dallas, Dallas, Tex., U.S. 17. University of Oklahoma's Advanced Center for Genome Technology, Dept. of

Chemistry and Biochemistry, University of Oklahoma, Norman, Okla., U.S. 18. Max Planck Institute for Molecular Genetics, Berlin, Germany 19. Cold Spring Harbor Laboratory, Lita Annenberg Hazen Genome Center, Cold Spring

Harbor, N.Y., U.S. 20. GBF - German Research Centre for Biotechnology, Braunschweig, Germany

Model Organisms in HGP

Page 11: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

10 �

Scientists are also mapping and sequencing the genomes of a number of "model" organisms as well as humans. Model organisms offer a great way to follow the inheritance of genes that are very similar to human genes through many generations in a relatively short time. There are several web sites which update the progress of the sequencing efforts on model organisms as well as a collection of microbial genomes. The Genome Monitoring Table updates monthly the sequencing activity on the human genome as well as a few model organisms, this site also dynamically predicts the end date for these sequencing efforts based on the current rates of sequencing. The Magpie Genome Sequencing Project List is a complete list of all organisms with genome sequencing activity. This site also indicates the progress on each genome, and provides links to sites with primary sequence and mapping information.

Genome Sequencing Efforts in Model Organisms Organism No. of

chromosomes DNA (Mb)

% sequenced (January 2001)

No. of genes

Mug shot

Escherichia coli

(bacterium)

1 47 100 3,000

Saccharomyces cerevisae

(yeast)

16 12 100 6,000

Caenorhabditis elegans

(nematode)

6 97 100 ?

Arabidopsis thaliana

(Arabidopsis)

10 118 100 ~50,000

Drosophila melanogaster

(fruit fly)

4 135 100 ?

Fugu rubripes

(puffer fish)

Mus musculus

(mouse)

19 + X/Y 3,059 1.7 ~80,000

Homo sapiens sapiens

(human)

22 + X/Y 3,286 draft 100% ~80,000

Page 12: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

11 �

Genome Mapping

A primary goal of the Human Genome Project is to make a series of descriptive maps of each human chromosome at increasingly finer resolutions. Mapping involves (1) dividing the chromosomes into smaller fragments that can be propagated and characterized and (2) ordering the fragments to correspond to their respective locations on the chromosomes. Additionally, the project researchers will be improving the instruments and techniques used in mapping and sequencing the genome, along with automating methods and optimizing techniques used to extract information from maps and sequences.

Normal Human Male Karyotype

There are two types of maps. The first is a genetic map, also called a linkage map, which shows the relative locations of specific DNA markers along the chromosome. Any inherited physical or molecular characteristic that differs among individuals and is easily detectable is a potential genetic marker.Markers can be expressed DNA regions (genes) or DNA segments that have no known coding function but whose inheritance pattern can be followed. DNA sequence differences are especially useful markers because they are plentiful and easy to characterize precisely. The second type of map is a physical map. Different types of physical maps vary in their degree of resolution. The lowest resolution physical map is the chromosomal or cytogenetic map, which is based on the banding patterns of stained chromosomes observed through a light microscope. A cDNA map shows the locations of expressed DNA regions (exons) on the chromosomal map. The more detailed cosmid contig map depicts the order of overlapping DNA fragments spanning the genome. A macro restriction map describes the order and distance between enzyme cleavage sites. The highest- resolution physical map is the complete DNA base pair sequence of each chromosome. Creating a cytogenetics map is like creating a map showing the 48 states; cDNA and cosmid maps would be like writing the names of the towns and cities; and sequencing the basepairs is equivalent to filling in the names of the streets.

Page 13: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

12 �

Whose genome was sequenced?

In the IHGSC international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few of many collected samples were processed as DNA resources. Thus the donor identities were protected so neither donors nor scientists could know whose DNA was sequenced. DNA clones from many different libraries were used in the overall project, with most of those libraries being created by Dr. Pieter J. de Jong. It has been informally reported, and is well known in the genomics community, that much of the DNA for the public HGP came from a single anonymous male donor from Buffalo, New York (code name RP11).

HGP scientists used white blood cells from the blood of 2 male and 2 female donors (randomly selected from 20 of each) -- each donor yielding a separate DNA library. One of these libraries (RP11) was used considerably more than others, due to quality considerations. One minor technical issue is that male samples contain only half as much DNA from the X and Y chromosomes as from the other 22 chromosomes (the autosomes); this happens because each male cell contains only one X and one Y chromosome, not two like other chromosomes (autosomes). (This is true for nearly all male cells not just sperm cells).

Although the main sequencing phase of the HGP has been completed, studies of DNA variation continue in the International HapMap Project, whose goal is to identify patterns of single nucleotide polymorphism (SNP) groups (called haplotypes, or “haps”). The DNA samples for the HapMap came from a total of 270 individuals: Yoruba people in Ibadan, Nigeria; Japanese people in Tokyo; Han Chinese in Beijing; and the French Centre d’Etude du Polymorphisme Humain (CEPH) resource, which consisted of residents of the United States having ancestry from Western and Northern Europe.

In the Celera Genomics private-sector project, DNA from five different individuals were used for sequencing. The lead scientist of Celera Genomics at that time, Craig Venter, later acknowledged (in a public letter to the journal Science) that his DNA was one of those in the pool.

Page 14: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

13 �

GENOME VARIATIONS Genome variations are differences in the sequence of DNA from one person to the next. Just as you can look at two people and tell that they are different, you could, with the proper chemicals and laboratory equipment, look at the genomes of two people and tell that they are different, too. In fact, people are unique in large part because their genomes are unique.

How different is one human genome from another?

The more closely related two people are, the more similar their genomes. Scientists estimate that the genomes of non-related people—any two people plucked at random off the street—differ at about 1 in every 1,200 to 1,500 DNA bases, or "letters." Whether that's a little or a lot of variation depends on your perspective. There are more than three million differences between your genome and anyone else's. On the other hand, we are all 99.9 percent the same, DNA-wise. (By contrast, we are only about 99 percent the same as our closest relatives, chimpanzees.)

Most genome variations are relatively small and simple, involving only a few bases—an A substituted for a T here, a G left out there, a short sequence such as CT added somewhere else, for example. Your genome probably doesn't contain long stretches of DNA that someone else's lacks.

If the genome were a book, every person's book would contain the same paragraphs and chapters, arranged in the same order. Each book would tell more or less the same story. But my book might contain a typo on page 303 that yours lacks, and your book might use a British spelling on page 135—"colour"—where mine uses the American spelling—"color."

If every human genome is different, what does it mean to sequence "the" human genome?

The complete human genome sequence announced in June 2000 is a "representative" genome sequence based on the DNA of just a few individuals. The scientific paper was published in the February 16, 2001 issue of Science. Over the longer term, scientists will study DNA from many different people to identify where and what variations between individual genomes exist. Sequencing a genome is such a Herculean task that capturing its person-to-person variability on the first pass would be next to impossible.

But that doesn't mean that the representative sequence we have now will be useless—far from it. The vast majority of the genome's sequence is the same from one person to the next, with the same genes in the same places. In other words, my genome is a pretty good

Page 15: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

14 �

approximation of yours, and if scientists sequenced your genome they would learn a lot about mine. Moreover, since every person's genome is unique, no one person is any more or less "representative" than any other and it hardly matters whose genome is sequenced first.

A Procedure Overview As with all systematic sequencing projects, the basic experimental problem in sequencing lies in the fact that the output of a single reaction (a ‘read’) yields about 500–800 bp1,4. To determine the sequence of a DNA molecule that is millions of bases long, it must first be fragmented into pieces that are within an order of magnitude of the read size. The sequence at one or both ends of many such fragments is determined, and the pieces are then assembled’ back into the long linear string from which they were originally derived. A number of approaches for doing this have been suggested and tested; the most commonly used is shotgun sequencing4. The application of shotgun sequencing to the mul-timegabase- or gigabase-sized genomes of metazoans is still evolving. A small number of strategies are currently being evalu-ated, for example, hierarchical or map-based shotgun sequenc-ing, whole-genome shotgun sequencing and hybrid approaches.

The IHGSC’s human sequencing effort began as a purely map-based strategy and evolved into a hybrid strategy1. The ‘pipeline’ that the IHGSC used to generate the human sequence data involved the following steps.

1. Bacterial artificial chromosome (BAC) clones were selected, and a random subclone library was constructed for each one in either an M13- or a plasmid-based vector.

2. A small number of members of the subclone library (usually 96 or 192) were

sequenced to produce very-low-coverage, single-pass or ‘phase 0’ data. These data were used for quality control and can be found in the Genome Survey Sequence division of The DNA Database of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL) and GenBank (of the National Cen-ter for Biotechnology and Information; NCBI).

3. If a BAC clone met the requisite standard, subclones were derived and sufficient

sequence data generated from these to pro-vide our- to fivefold coverage (that is, enough data to represent an average base in the BAC clone between four and five times). This is known as ‘draft-level’ coverage, and permits the assembly of sequence using computer programs that can detect overlaps between the random reads from the subclones, yielding longer ‘sequence contigs’. At this stage, the sequence of a BAC clone could typically exist on between four and ten different contigs, only some of which were ordered and oriented with respect to one another. The BAC ‘projects’ were submitted, within 24 hours of having been assembled, to the High-Throughput Genomic Sequences (HTGS) division of DDBJ/EMBL/GenBank5, where each was given a unique accession number and identified with the keyword ‘htgs_draft’. (The DDBJ, EMBL and GenBank are members of the International Nucleotide Sequence Database Collaboration, whose members exchange data nightly and assure that the

Page 16: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

15 �

sequence data generated by all public sequencing efforts are made available to all interested parties freely and in a timely fashion.) Less-complete high-throughput genomic (HTG) records are also known as ‘phase 1’ records. As the sequence is refined, it is designated ‘phase 2’. In the context of a BLAST search at the NCBI, these sequences would be available in the HTGS database.

4. In late 2000, the draft sequence of the entire human genome was assembled from the

sequence of 30,445 clones (BAC clones and a relatively small number of other large-insert clones). This assembled draft human genome sequence was published in Feb-ruary 2001 and made publicly available through three primary portals: the University of California, Santa Cruz (UCSC), Ensembl (of the European Bioinformatics Institute; EBI) and the NCBI. The use of all three of these sites to obtain annotated information on the human genome sequence is the primary subject of this guide.

5. Subsequent to the generation and publication of the draft human genome sequence,

work has continued towards finishing the sequencing. The final stage initially targeted draft-quality BAC clones. For each of these clones, enough additional shotgun sequence data are obtained to bring the coverage to eight- to tenfold, a stage referred to as ‘fully topped-up’. The data from each fully topped-up BAC are reassembled, typically resulting in a smaller number of contigs (often in just a single contig) than at the raft level. The new assembly is again submitted to the HTGS division as an update of the existing BAC clone, now identified with the keyword ‘htgs_fulltop’. The accession number of the clone stays the same, and the version number increases by one (AC108475.2, for example, becoming AC108475.3).

6. At this stage, there are, even for clones comprising a single contig, typically some

regions that are of insufficient quality for the clone to be considered finished. If this is the case, the fully topped-up sequence is analyzed by a sequence finisher (an actual person) who collects, in a directed manner, the additional data that are needed to close the few remaining gaps and to bring any regions of low quality up to the finished sequence standard. While the clone is worked on by the finisher, the HTGS entry in GenBank is identified by the keyword ‘htgs_activefin’. Once work on the clone has been completed, the keyword of the HTG record is changed to ‘htgs_phase3’, the version number is once again increased, and the record is moved from the HTGS division to the primate division of DDBJ/EMBL/GenBank. In the context of a BLAST search at NCBI, these finished BAC sequences would now be available in the nr (“non-redundant”) database.

Page 17: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

16 �

7. The finished clone sequences are then put together into a finished chromosome sequence. As with the initial draft assemblies, there are a number of steps involved in this process that use map-based and sequence-based information in calculating the maps. The final assembly process involves identifying overlaps between the clones

and then anchoring the finished sequence contigs to the map of the genome; details of the process can be found on the NCBI web site (http://www.ncbi.nlm.nih.gov/ genome/guide/build.html). Initially, both the UCSC and NCBI groups generated complete assemblies of the human genome, albeit using different approaches. As noted on the UCSC web site, the NCBI assembly tended to have slightly better local order and orientation, whereas the UCSC assembly tended to track the chromosome-level maps somewhat better. Rather than having different assemblies based on the same data, IHGSC, UCSC, Ensembl and NCBI decided that it would be more productive (and obviously less confusing) to focus their efforts on a single, definitive assembly. To this end, and by agreement, the NCBI assembly will be taken as the reference human genome sequence.

Page 18: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

17 �

Managing and Using Data : Bioinformatics Boom Annotating the assemblies Once the assemblies have been constructed, the DNA sequence undergoes a process known as annotation, in which useful sequence features and other relevant experimental data are cou-pled to the assembly. The most obvious annotation is that of known genes. In the case of NCBI, known genes are identified by simply aligning Reference Sequence (RefSeq) mRNAs (see box),

GenBank mRNAs, or both to the assembly. If the RefSeq or Gen-Bank mRNA aligns to more than one location, the best align-ment is selected. If, however, the alignments are of the same quality, both are marked on to the contig, subject to certain rules (specifically, the transcript alignment must be at least 95% iden-tical, with the aligned region covering 50% or more of the length, or at least 1,000 bases). Transcript models are used to refine the alignments. Ensembl identifies ‘best in genome’ positions for known genes by performing alignments between all known human proteins in the SPTREMBL database6 and the assembly using a fast protein-to-DNA sequence matcher7. UCSC predicts the location of known genes and human mRNAs by aligning Ref-Seq and other GenBank mRNAs to the genome using the BLAST-like alignment tool (BLAT) program. In addition to identifying and placing known genes onto the assemblies, all of the major genome browser sites provide ab initio gene predictions, using a variety of prediction programs and approaches.

Genome annotation goes well beyond noting where known and predicted genes are. Features found in the Ensembl, NCBI and UCSC assemblies include, for example, the location and placement of single-nucleotide polymorphisms, sequence-tagged sites, expressed sequence tags, repetitive elements and clones. Full details on the types of annotation available and the methods underlying sequence annotation for each of these dif-ferent types of sequence feature can be found by accessing the URLs listed under Genome Annotation in the Web Resources section of this guide. At UCSC, many of the annotations are pro-vided by outside groups, and there may be a significant delay between the release of the genome assembly and the annotation of certain features. Furthermore, some tracks are generated for only a limited number of assemblies. For an in-depth discussion of genome annotation, the reader is referred to an excellent review by Stein and the references cited therein. This review, along with the Commentary in this guide, also provides cautions on the possible overinterpretation of genome annotation data.

Page 19: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

18 �

The data—and sometimes the tools—change every day

The steps outlined in the previous section should emphasize that the state of the human genome sequence will continue to be in flux, as it will be updated daily until it has actually been declared ‘finished’. (Finished sequence is properly defined as the “complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps”. A more practical definition is that of “essentially finished sequence,” meaning the complete sequence of a clone or genome, with an accuracy of at least 99.99% and no gaps, except those that cannot be closed by any current

method.) The reader should be mindful of this, not just when reading this guide, but also, when referring back to it over time. Similarly, the tools used to search, visualize and analyze these sequence data also undergo constant evolution, capitalizing on new knowledge and new technology in increasing the usefulness of these data to the user. Over the next year, sequence producers will continue to add finished sequence to the nucleotide sequence databases, and the NCBI will continue to update the human sequence assembly until its ultimate completion. The human genome sequence will, however, continue to improve even after April 2003, as new cloning, mapping and sequencing technologies lead to the clo-sure of the few gaps that will remain in the euchromatic regions. It is hoped that such technological advances will also allow for the sequencing of heterochromatic regions, regions that cannot be cloned or sequenced using currently available methods. The sequence-based and functional annotations presented at the three major genome portals will certainly continue to evolve long after April 2003. Computational annotation is a highly active area of research, yielding better methods for identifying coding regions, noncoding transcribed regions and noncoding, non-transcribed functional elements contained within the human sequence.

Accessing human genome sequence data Although each of the three portals through which users access genome data has its own distinctive features, coordination among the three ensures that the most recent version and anno-tations of the human genome sequence are available.

Ensembl (http://www.ensembl.org) is the product of a collaborative effort between the Wellcome Trust Sanger Institute and EMBL’s European Bioinformatics Institute and provides a bioin-formatics framework to organize biology around the sequences of large genomes7. It contains comprehensive human genome annotation through ab initio gene prediction, as well as infor-mation on putative gene function and expression. The web site

Page 20: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

19 �

provides numerous different views of the data, which can be either map-, gene- or protein-centric. Ensembl is actively build-ing comparative genome sequence views, and presents data from human, mouse, mosquito and zebrafish. In addition, numerous sequence-based search tools are available, and the Ensembl system itself can be downloaded for use with individ-ual sequencing projects. The UCSC Genome Browser (http://genome.ucsc.edu) was originally developed by a relatively small academic research group that was responsible for the first human genome assem-blies. The genome can be viewed at any scale and is based on the intuitive idea of overlaying ‘tracks’ onto the human genome sequence; these annotation tracks include, for exam-ple, known genes, predicted genes and possible patterns of alternative splicing. There is also an emphasis on comparative genomics, with mouse genomic alignments being available. The browser also provides access to an interactive version of the BLAT algorithm, which UCSC uses for RNA and compar-ative genomic alignments. Given its Congressional mandate to store and analyze biologi-cal data and to facilitate the use of databases by the research com-munity, the NCBI (http://www.ncbi.nlm.nih.gov) serves as a central hub for genome-related resources. NCBI maintains Gen-Bank, which stores sequence data, including that generated by the HGP and other systematic sequencing projects. NCBI’s Map Viewer provides a tool through which information such as exper-imentally verified genes, predicted genes, genomic markers, physical maps, genetic maps and sequence variation data can be visualized. The Map Viewer is linked to other NCBI tools—for example, Entrez, the integrated information retrieval system that provides access to numerous component databases.

Page 21: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

20 �

Societal Concerns Arising from New Genetics

Critical policy and Ethical Issues

The ethical issues raised by the human genome project can be grouped into two general categories: genetic engineering and genetic information.

Genetic engineering

The first category consists of issues pertaining to genetic manipulation or what is sometimes called "genetic engineering." The map of the human genome provides information that will allow us to diagnose and eventually treat many diseases. This map will also enable us to determine the genetic basis of numerous physical and psychological traits, which raises the possibility of altering those traits through genetic intervention. Reflection on the ethical permissibility of genetic manipulation is typically structured around two relevant distinctions:

• the distinction between somatic cell and germline intervention, and • the distinction between therapeutic and enhancement engineering

In germline engineering, changes are passed along in the genome of future generations.

Somatic cell manipulation alters body cells, which means that resulting changes are limited to an individual. In contrast, germline manipulation alters reproductive cells, which means that changes are passed on to future generations. Therapeutic engineering occurs when genetic interventions are used to rectify diseases or deficiencies. In contrast, enhancement engineering attempts extend traits or capacities beyond their normal levels.

• The use of somatic cell interventions to treat disease is generally regarded as ethically acceptable, because such interventions are consistent with the purpose of medicine, and because risks are localized to a single patient.

• Germline interventions involve more significant ethical concerns, because risks will extend across generations, magnifying the impact of unforeseen consequences. While these greater risks call for added caution, most ethicists would not object to the use of germline interventions for the treatment of serious disease if we reach the point where such interventions could be performed safely and effectively. Indeed, germline interventions would be a more efficient method for treating disease, since a single intervention would render both the patient and his or her progeny disease-free, thus removing the need for repeated somatic cell treatments across future generations.

Altering one gene may not achieve the desired enhancement since many traits involve a mix of genes.

Page 22: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

21 �

Enhancement engineering is widely regarded as both scientifically and ethically problematic. From a scientific standpoint, it is unlikely that we will soon be able to enhance normally functioning genes without risking grave side effects. For example:

• Enhancing an individual's height beyond his or her naturally ordained level may inadvertently cause stress to other parts of the organism, such as the heart.

• Moreover, many of the traits that might be targeted for enhancement (e.g., intelligence or memory) are genetically multifactorial, and have a strong environmental component. Thus, alteration of single genes would not likely achieve the desired outcome.

• These problems are magnified, and additional problems arise, when we move from somatic cell enhancements to germline enhancements.

Future generations may feel limited by choices made regarding their genetic

traits. In addition to the problem of disseminating unforeseen consequences across generations, we are faced with questions about whether future generations would share their predecessors' views about the desirability of the traits that have been bequeathed to them. Future generations are not likely to be ungrateful if we deprive them of genes associated with horrible diseases, but they may well feel limited by choices we have made regarding their physical, cognitive, or emotional traits. In short, there is a danger that social-historical trends and biases could place genetic limitations on future generations. What rules should be set

for the acquisition and use of genetic information?

Genetic screening results can create difficult situations for patients and their families.

Genetic information

The second general category consists of ethical questions pertaining to the acquisition and use of genetic information. Once we pinpoint the genetic basis for diseases and other phenotypic traits, what parameters should be set for the acquisition and use of genetic information? The key issue to be considered here is the use of genetic screening. Screening for diseases with the due consent of a patient or a legal proxy is generally viewed as ethically permissible, but even this form of screening can create some significant ethical challenges. Knowledge that one is or may be affected by a serious disease can create difficult situations for both patients and their families. Consider:

• If a test is positive, what options, medical or otherwise, will be available to ameliorate the condition?

• Will the patient's relatives be informed that they too may be affected by the ailment?

It is the job of genetic counselors to educate patients about the implications of genetic knowledge, and to help patients anticipate and deal with these challenges.

Should mandatory genetic screening be rejected in all situations?

Page 23: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

22 �

Mandatory genetic screening of the adult population raises serious ethical questions about personal liberty and privacy, and thus is not likely to garner widespread support. Nevertheless, we are likely to hear calls for mandatory genetic testing in specific social contexts, and existing practices will no doubt be cited as justifications for such testing. For example, in the justice system, longstanding practices of fingerprinting, urine testing, and blood testing are already being supplemented by DNA testing. Genetic testing is of particular concern when it comes to health insurance.

Of particular concern is the specter of genetic testing in the insurance industry. When individuals apply for insurance policies, they are often required to provide family medical history, as well as blood and urine samples. At present, however, insurance companies in the United States cannot require genetic testing of applicants. While this prohibition is designed to prevent genetic discrimination, insurance industry lobbyists will surely be pressing the following kind of argument in coming years:

• If it is considered fair and proper to identify applicants with high cholesterol and/or a family history of heart disease, and to charge those applicants higher premiums, why should it be considered unfair to utilize genetic testing to accomplish the same goals?

Such questions will have to be seriously considered by ethicists and lawmakers, in the attempt to achieve a fair balance between individual rights and the rights of insurance companies. Indeed, the development of genetic screening for a broad array of diseases and conditions may eventually lead us to rethink the principles that are used to determine insurability and the apportionment of payment burdens.

The genetic screening of newborns or others who are incapable of valid consent presents additional ethical questions.

Additional ethical questions arise when we consider genetic screening of newborns, young children, and others who cannot give valid consent for such procedures:

• As more genetic tests become available, which ones should be universally administered to newborns?

• What role should parental consent play in determining when children are screened?

Newborns are routinely tested for PKU without the explicit consent of parents.

Decisions about the implementation of universal genetic screening for newborns will likely follow existing policies, which perform tests for serious, early-onset diseases that are susceptible to treatment. The paradigm case for such universal screening is phenylketonuria (PKU). Newborns are routinely tested for PKU without the explicit consent of parents, under the assumption that parents want to know if their child is afflicted with this potentially devastating but easily treatable condition. Of course, the moral propriety of newborn screening becomes more complicated when we begin to deviate from this paradigm case. Determining whether screening should be pursued in cases like this will not always be easy:

Page 24: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

23 �

• X What if the disease is not easily treatable, or can only be treated at great expense that parents may not want to incur?

• What if an ailment is late onset and untreatable, as is the case with Huntington's disease? What if a test can only determine a probability, not a certainty, that a child will develop a disease?

With genetic testing, there is potential for conflict between a parent's choice

and a child's welfare.

Of course, from a legal standpoint parents have broad discretion when it comes to decisions about their children's health and welfare, and this will no doubt hold true for decisions about both genetic testing and genetic engineering as these procedures become increasingly available. While this broad discretion is based on respect for parental autonomy and on a desire for minimal government intrusion into family life, we must acknowledge the potential for conflict between a parent's choice and a child's welfare.

• What if a parent refuses to consent to a test that is clearly in their child's best interest? • What if a parent decides to pursue a genetic "enhancement" that involves significant

risks for a child, or that may limit a child's life prospects?

While these questions may seem far-fetched to some, it is worth noting that current laws in most states allow parents to opt out of testing for PKU, despite the fact that this may leave their child exposed to a devastating disease.

Conclusion: As genetic engineering and information use increases, so will

ethical questions. Today, we face many important challenges pertaining to the use and distribution of genetic research and information. As our capabilities for genetic screening and genetic engineering increase, we are likely to encounter more difficult ethical questions, including questions about the limits of parental autonomy and the application of child welfare laws.

Page 25: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

24 �

What We've Learned So Far

What Does the Draft Human Genome Sequence Tell Us?

By the Numbers

• The human genome contains 3164.7 million chemical nucleotide bases (A, C, T, and G).

• The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases.

• The total number of genes is estimated at 30,000 —much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas.

• Almost all (99.9%) nucleotide bases are exactly the same in all people. • The functions are unknown for over 50% of discovered genes.

The Wheat from the Chaff

• Less than 2% of the genome codes for proteins. • Repeated sequences that do not code for proteins ("junk DNA") make up at least 50%

of the human genome. • Repetitive sequences are thought to have no direct functions, but they shed light on

chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes.

• During the past 50 million years, a dramatic decrease seems to have occurred in the rate of accumulation of repeats in the human genome.

How It's Arranged

• The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C.

• In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes.

• Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between.

• Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity.

• Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).

Page 26: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Human Genome Project

25 �

How the Human Compares with Other Organisms

• Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout.

• Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene.

• Humans share most of the same protein families with worms, flies, and plants, but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity.

• The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%).

• Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift.

Variations and Mutations

• Scientists have identified about 1.4 million locations where single-base DNA differences (SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for disease-associated sequences and tracing human history.

• The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs.

Applications, Future Challenges

Deriving meaningful knowledge from the DNA sequence will define research through the coming decades to inform our understanding of biological systems. This enormous task will require the expertise and creativity of tens of thousands of scientists from varied disciplines in both the public and private sectors worldwide.

The draft sequence already is having an impact on finding genes associated with disease. A number of genes have been pinpointed and associated with breast cancer, muscle disease, deafness, and blindness. Additionally, finding the DNA sequences underlying such common diseases as cardiovascular disease, diabetes, arthritis, and cancers is being aided by the human variation maps (SNPs) generated in the HGP in cooperation with the private sector. These genes and SNPs provide focused targets for the development of effective new therapies.

One of the greatest impacts of having the sequence may well be in enabling an entirely new approach to biological research. In the past, researchers studied one or a few genes at a time.

Page 27: human genome project - WordPress.com · Human Genome Project 5 since the inception of the HGP, and these data now enable detailed comparisons among organisms, including humans. Insights

Atul Nag and Rohit Farmer

26 �

With whole-genome sequences and new high-throughput technologies, they can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example, or all the transcripts in a particular tissue or organ or tumor, or how tens of thousands of genes and proteins work together in interconnected networks to orchestrate the chemistry of life.

The Next Step: Functional Genomics

The words of Winston Churchill, spoken in 1942 after 3 years of war, capture well the HGP era: "Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning."

The avalanche of genome data grows daily. The new challenge will be to use this vast reservoir of data to explore how DNA and proteins work with each other and the environment to create complex, dynamic living systems. Systematic studies of function on a grand scale-functional genomics-will be the focus of biological explorations in this century and beyond. These explorations will encompass studies in transcriptomics, proteomics, structural genomics, new experimental methodologies, and comparative genomics.

• Transcriptomics involves large-scale analysis of messenger RNAs transcribed from active genes to follow when, where, and under what conditions genes are expressed.

• Studying protein expression and function--or proteomics--can bring researchers closer to what's actually happening in the cell than gene-expression studies. This capability has applications to drug design.

• Structural genomics initiatives are being launched worldwide to generate the 3-D structures of one or more proteins from each protein family, thus offering clues to function and biological targets for drug design.

• Experimental methods for understanding the function of DNA sequences and the proteins they encode include knockout studies to inactivate genes in living organisms and monitor any changes that could reveal their functions.

• Comparative genomics—analyzing DNA sequence patterns of humans and well-studied model organisms side-by-side—has become one of the most powerful strategies for identifying human genes and interpreting their function.