The unveiling of the human genome draft sequences was a seminal achievement that ushered in the new era of post-genomic science. As such a scientific milestone, it is worthy of mention in the same breath as the Watson and Crick paper describing the structure of DNA. The human genome sequence has already given us new insight into our own relative biological complexity as well as the mechanisms of our evolution. However, it is impossible to predict the magnitude of the future scientific and social implications of this knowledge.
The sequencing of the genome was proposed in 1985 and endorsed in 1988. Though it began cautiously, it became a large coordinated effort between 20 government-sponsored research teams involving hundreds of people organized under the banner of the International Human Genome Sequencing Consortium. This government-funded group's effors were termed loosely the public project.
In 1998, Craig Venter founded a private company called Celera Genomics and made the stunning announcement that his company planned to complete the sequence of the genome within 3 years, well ahead of the public effort. By automating the entire sequencing process with robotics, a tremendous amount of computing power, and the latest capillary sequencers, Celera began churning out sequence data at a prodigious rate. Once Celera entered the game, the competition between this private venture and the public project became fierce.
The race to finish sequencing the human genome was on. The public projects lead start was not nearly as large as one might think, for much of the technology that was needed for this large sequencing effort was not available until the late 1990s, and Celera had access to all of the public projects openly available data.
The race to complete the sequence of the human genome, however, appears to have ended in a tie. in February of 2001, both groups separately published the draft sequences in Science and Nature.
The two camps, public and private, had different methodologies for accomplishing
the same goal. Their objective was to decipher the human genome, but they had
to wait for technology to advance and match the immense scale of the proposed
project. During this wait, a few other genomes were sequenced, in part to validate
# of Genes
|H. influenzae (bacteria)
|S. cerevisiae (brewer's yeast)
|C. elegans (nematode)
|A. thaliana (flowering plant)
|D. melanogaster (fly)
|M. musculus (mouse)
|H. sapeins (human)
|Adapted from Pennisi, E., Science 291,
The public project's approach was conservative and methodical. The team set out to first produce a physical map of the genome that would later serve as a scaffold for the sequence data. Specifically, their plan was to
- Map chromosomes first. Clone-based physical map is scaffold for sequence.
- Break genome into chunks of DNA whose positions on the chromosome was known
from maps, then clone into bacteria using BACs.
- Digest BAC-inserted clonal chunks of DNA into small fragments.
- Sequence small fragments.
- Stitch together BAC clones to assemble sequence.
- Assemble genome sequence from BAC clone sequences, using clone-based physical map.
Celeras private venture used a different approach, termed shotgun sequencing. This method required no organized map. Instead it involved shredding the genome into small DNA fragments, then using computing power to reconstruct the genome from the multitude of overlapping sequences present in the individual pieces. Celera's detailed program was to
- Shred genome randomly into small fragments without regard to their original physical location.
- Clone and sequence fragments.
- Use computer to stitch together genome by matching overlapping ends of sequenced fragments.
Genome sequencing was driven by technological progress. We have recognized
DNA as the carrier molecule for genetic information only relatively recently,
beginning with Watson and Cricks 1953 landmark paper on DNA structure. Recombinant
DNA was first made in 1972, yet it was not until 1977 that scientists could readily
sequence an entire gene. When the human genome project was first proposed in 1985,
a laboratory could only sequence 500 base pairs per day by hand. At this rate,
sequencing the 3,000,000,000 base human genome would have taken over 16,000 years!
The development of PCR in 1985 and the first automated DNA sequencing machine
in 1986, along with the use of bacterial artificial chromosomes (BACs) in 1992,
revolutionized large-scale cloning and brought the idea of sequencing the human
genome out of the realm of the unthinkable. A few genomes were sequenced with
these tools, yet their size was a far cry from that of the human genome. Genome
scientists waited impatiently for the available technology to improve. Indeed,
once automated sequencers and sufficient computing power could be implemented,
the newly formed Celera was able to gather sequence data at the rate of 1,000
bases per second. Quite impressively, they completed their draft of the human
genome in only one year of data gathering.
The two competing groups have simultaneously released rough draft sequences of the human genome that are fairly equal in quality and quantity. The term "draft" is used because the sequence is largely complete, but does have gaps and imperfections.
- The human genome is ~3.2 Gb.
- 90% of the 2.5 Gb of gene-rich (euchromatic) DNA has been sequenced.
- Only 28% is transcribed into RNA.
- Only 1.1%1.4% of the genome actually encodes protein (= 5% of transcribed
- A slim majority of DNA is repeats (junk DNA).
What is considered finished? One standard is that fewer than 1
base in 10,000 is incorrectly assigned, more than 95% of the euchromatic regions
are sequenced, and each gap is smaller than 150 kb. By these criteria, over one-quarter
of the public project's sequence is already considered finished.
WHERE IS THIS INFORMATION?
All of the data from the public project, as well as sequence and mapping tools, are available on the Internet from the public Human Genome Database at the NCBI.
The genome database has many tools to locate a gene of interest or search for potential traits associated with the gene.
Below is a chromosomal map search result for the "breast cancer gene" BRCA2. Mutations in this gene greatly increases a woman's chance of developing breast cancer. As shown below, BRCA2 is located on chromosome 13:
WHAT WE HAVE DISCOVERED
Although the tools for analyzing raw sequence data are not yet fully developed, we have learned a few things and have stumbled upon a few surprises. Some of the major findings and analysis are presented below. Unexpected findings include the large amount of junk DNA present and the lower than expected number of genes. As for analysis of the sequence information, a fair amount of work has been done looking at gene variation between individuals.
Junk and parasitic DNA
Junk DNA sequences have, as yet, no apparent biological function. Junk sequences commonly take the form of long stretches of repeated sequence, and their purpose is a hot area of investigation.
- The human genome has far more repeat DNA than any other sequenced genome (over
- 45% of this repeat DNA is from selfish, parasitic DNA such as transposable elements.
- Junk DNA may play a role in evolution.
The number of genes in our genome is less than half of what was expected, but this surprise is mitigated by the fact that human genes are more complex than those of other species and are more difficult to detect than anticipated.
- Many fewer genes than expected: only 35,00045,000 genes vs. previously predicted 100,000.
- Only twice the number in a nematode or a fruit fly, but more than twice as complex.
- Alternative splicing: even though the coding regions (exons) for each gene are the same average size in human, worm, and fly, vertebrate genes are more innovative in their assembly of exons. Instead of specifying one protein, a human gene on average codes for three different proteins by utilizing different combinations of exons.
- Protein domains are mixed more creatively and in larger numbers by vertebrates.
- Genes are elusive: difficult to find/predict in human genome using computational methods.
The International Single Nucleotide Polymorphism (SNP) Map compiles 1.4 million SNPs, providing a map of the single-base pair differences between individuals.
- Disease resistance. Our
individual genomic differences are related to our susceptibilities to disease:
for example, sickle cell hemoglobin and malaria resistance.
- Response to therapeutics. SNPs can also be used to predict responses to particular drugs, in order to avoid adverse drug reactions.
- Evolution. We can try to reconstruct the history
of human evolution and migration by analyzing patterns of molecular genetic variation.
- Natural selection. We can study the genetic effects of natural selection in humans by looking for frequent and similar genetic variations in groups of people exposed to similar environmental pressures.
- Individual traits. We can explain how one person
is different than another (appearance, intelligence, physical prowess, emotional
The SNP database can be used to investigate mutations in the "breast cancer gene" BRCA2. Shown below is the location on chromosome 13 and the initial sequence of BRCA2 with one of the mapped variations, obtained from the public genome database at the NCBI. Note the location of the variation in the mRNA coding portion of the gene.
Filling in the blanks and analysis
With the draft sequence complete, scientists must press on and obtain the rest of the human genome sequence. Finishing the sequence is more difficult than obtaining a draft sequence, because the majority of the remaining unsequenced DNA consists of blocks of repeats that are hard to clone, sequence, and reassemble reliably.
Making sense of the available sequence information and putting it to work is another immediate focus for the scientific community. The human genome project has opened tremendous opportunities in the area of bioinformatics, which combines computer technology with molecular biology.
Of special interest to scientists, especially comparative biologists, is the
sequencing of the genomes of other vertebrate species for comparison. Most valuable
would be obtaining the genome sequence information for our closest evolutionary
neighbor, the chimpanzee. Also, because mice are commonly used as animal models
in the study of human disease, this genome sequence is also eagerly awaited. Some
pending vertebrate genome projects include the following:
Proteomics is genomics' successor to the "omics" throne. The genome sequence is a simple two-dimensional look at our biology. It is the definingyet still cryptictext of the manual explaining human biology. However, because biochemical processes are performed by proteins, the molecular workhorses of cells, proteomics sets out to define the structures of and relationships between the proteins found in the genome. This represents an obvious yet much more difficult next step in understanding human biology at the molecular level. In a sense, proteomics can provide the pictures to illustrate our defining manual and clarify the text provided by the human genome sequence.
For more on this topic, see the Proteomics
Gene and protein chips (microarrays)
Genomic sequence information is invaluable to the production of gene
chips. These are simply microarrays of sequences that mark a large population
of an organisms genes. Gene chips are created by placing small amounts of
unique DNA segments in a large array onto a solid support (chip), often a simple
glass slide. The DNAs immobilized on the chip are frequently designed to be complementary
and therefore capable of binding specifically to the genes or expressed messenger
RNA of an organism. Cellular samples of DNA or RNA (e.g., from diseased tissue)
can be added to the chips to screen for genetic mutations. Questions about which
genes are actively being transcribed, or "turned on," can be answered
by quantifying the expression of mRNAs corresponding to specific gene sequences.
Alternatively, if the immobilized DNA is an array of gene promoters, then a sample
can also be screened for transcriptional regulators present in a cellular sample.
Protein chips are similar arrays of immobilized proteins that can
probe for protein–protein interactions, protein–nucleic acid interactions,
or even for binding and activity when exposed to drugs or enzyme substrates.