To understand the history and diversity of life is a preeminent goal of modern science. From a biological sciences perspective, the field that claims this pursuit is called systematics, which is defined as the study of biological diversity in an evolutionary context. Within the study of systematics, scientists trace the phylogeny, or evolutionary history, of a species or group of related species.
The ideal of a systematist is to account for the evolutionary history of all
species, dating back to the very origin of life. The modern systematist employs
techniques that classify organisms based on anatomical and molecular characteristics.
Anatomically characterizing an organism involves two main approaches: studying
the morphology of animals and analyzing the fossil record. Molecularly characterizing
an organism uses various sequencing techniques to identify similarities in genetic
information between organisms as expressed in nucleic acids or proteins.
Data generated from various techniques are used to develop hypotheses that
ultimately classify an organism or species based on its characteristics. This
involves employing a system of record keeping that allows the systematist to
organize vast and diverse sets of comparative information. Phylogenetic trees,
or diagrams that trace evolutionary relationships, serve this purpose. Phylogenetic
trees are constructed to record the hypothesized classifications of organisms.
If a group of organisms is hypothesized to share a common ancestor, the group
is referred to as monophyletic. If members of a group did not all evolve from
a common ancestor, the group is referred to as polyphyletic.
UNDERSTANDING RELATEDNESS USING ANATOMICAL CHARACTERIZATION
A traditional approach to classifying organisms relies upon the simple hypothesis
that the greater the anatomical similarities between organisms, the more related
they might be in evolutionary history. This is based on two major approaches:
characterizing the morphology of living animals, plants, and microorganisms; and
studying fossils, the preserved impressions left by organisms.
The morphology of an organism is simply a description of its physical characteristics. If the organism under study is extinct or impossible to resolve with modern microscopy techniques, observing morphology is unfeasible.
The fossil record
The fossil record contains fossilized remains and imprints whose age is estimated
by the age of the surrounded rock. The oldest known fossils are believed to be
approximately 3.5 billion years old and represent the existence of bacterium-like
life. An example of a dating technique used to determine the ages of rocks and
fossils on a scale of absolute time is radiometric dating.
UNDERSTANDING RELATEDNESS USING MOLECULAR TECHNIQUES
Recent advances in biochemistry have exposed powerful molecular tools for constructing phylogenetic trees. Protein, mRNA, rRNA, and genomic sequencing are molecular techniques commonly used in identifying and characterizing organisms.
Molecular sequence data can be used to calibrate molecular clocks
based on the rate of mutations in protein and nucleic acid sequences. The rationale
of molecular clocks is that the observed number of changes in an amino acid
or nucleotide sequence may be approximately proportional to the evolutionary
The National Center for Biotechnology Information houses a public database
of sequenced nucleotides and proteins and develops sequence analysis shareware.
The amino acid sequence of cytochrome c has been analyzed in over 100
eukaryotic species, and the molecular data support the notion that cytochrome
c is an evolutionarily conservative protein. Shown below are two protein
sequence comparisons for cytochrome c: between human and either rat or
Substitutions within the primary structure of cytochrome c are relatively
constant over time, which makes characterizing cytochrome c a potentially
useful molecular clock. For example, note the cytochrome c sequence similarity
between humans and rats (91%) is much higher than between humans and yeast (64%).
This indicates that the evolutionary path leading to the human species diverged
with that of yeast far before its divergence from rat. A comparison of cytochrome
c sequences from different species has been used to order the divergence
of species in relative time.
Determining the absolute timing of events using molecular clocks attracts skepticism
for several reasons. The fact that different proteins evolve at different rates
makes determining the divergence rate of a particular protein problematic. Also,
changes in generation time or metabolic rate may affect a mutation rate, making
molecular clocks less predictable.
Method of protein sequencing and alignment
A common approach to protein sequencing involves repeatedly cleaving a polypeptide
(using a protein-digesting enzyme), separating the fragments using chromatography,
and sequencing the resulting small fragments. The entire protein sequence is
then reconstructed by matching regions of sequence overlap seen in the small
Algorithms may be used to compare protein sequence data against a database.
These algorithms provide local information (within a defined region) or global
information (which accounts for gaps or other aberrations that may mask multiple
regions of similarity). This technology has greatly facilitated the grouping
of related proteins into families (an example is the serine protease family
of enzymes, which includes trypsin, chymotrypsin, and elastase).
The significance of protein sequencing
Direct protein sequencing is an indispensable tool for several reasons. For
example, the position of disulfide bonds can be resolved using protein sequencing
techniques. Also, protein modifications, such as the excision of residues or
the covalent attachment of other groups, can be detected. These changes in protein
sequence and structure would evade detection by nucleic acid sequencing efforts.
One drawback to protein sequencing is that only those genes that code for proteins are under scrutiny. This is often a small fraction of an organisms total amount of molecular information.
Student activity 1 relates to protein sequencing
Nucleic acid data
The sequencing of nuclear, mitochondrial, and genomic DNA, along with mRNA,
tRNA, and rRNA, provides the systematist with a potential wealth of phylogenetic
data. Because of the large amount of data generated from nucleic acid sequencing,
especially on the genomic level, the complexity of the algorithms used in sequence
analysis is far greater than what is needed to analyze protein sequence data.
Method of DNA sequencing and alignment
The most commonly used method for high-throughput DNA sequencing is the chain-termination method, which makes identifying 10,000 bases per day easily achievable.
Complex algorithms are used to quantitate sequence differences between copies
of a given gene in the same or different species. Genes are said to be homologous
if they share a common ancestor. Genes that are similar in sequence may or may
not be homologous.
Nucleic acid sequencing and advancements in systematics
- The use of ribosomal RNA (rRNA) sequencing led to the identification of
the Archaea lineage. Current data supports the notion that organisms have
been evolving in three independent lineages (Bacteria, Archaea, and Eukarya)
for over 1.5 billion years.
- Multiple strains of HIV were discovered from sequencing the viral genome
(which consists of a single-stranded RNA molecule). Two known types of HIV
exist, HIV-1 and HIV-2. The two types are further broken down into groups
and subgroups. The different strains display varying degrees of virulence
and possibly differing modes of transmission, as well.
Below is a small portion of the nucleic acid alignment for 2 different HIV
strains, obtained directly from the BLAST program at the NCBI website. Alignments
such as this help characterize the different strains of HIV and their evolutionary
relationship to one another.
- Human HIV-1 complete genome: 328902 (gi number); M62320 (accession number)
- Human HIV-2 complete genome: 9628880 (gi number); NC_001722 (accession number)
Limitations of nucleic acid sequencing
Not all of an organisms genome is composed of genes and distinguishing
genes from extra DNA, or junk DNA, is far from trivial. Adding to
the complexity, introns must be distinguished from exons within a gene.
The impact of genomic sequencing on phylogeny is minor to date. The main reason
for this is that the genomes of relatively few species (around 100) have been
sequenced, and these organisms are not all closely related.
Student activity 2 relates to nucleic acid sequencing
CONSTRUCTING A PHYLOGENETIC TREE
Phylogenetic trees are constructed in a variety of ways to summarize the evolutionary
relatedness of different organisms. The data represented in a phylogenetic tree
may come from (but are not limited to) observations of an organisms anatomical
features and/or molecular sequence information.
Constructing a simple phylogenetic tree
Phylogenetic trees may be constructed using a variety of methods and datasets.
In the hypothetical example described below, four study-groups (A, B, C, and D)
are divided among the evolutionary branches of a phylogenetic tree (the study
groups are analogous to different species), serving to illustrate the historical
pattern of change among groups and to postulate relationships.
- First, the study groups are compared to an outgroup. An outgroup
is related to the study groups, but not as closely related as the study groups
are to each other. Using an outgroup as a reference point provides an objective
way to distinguish more primitive groups from more recent groups.
- Second, a node is drawn (illustrated by an intersection point in the tree), symbolizing a common ancestor to all groups shown above the node.
- Third, additional branch points are added to symbolize the origin of groups possessing novel homologies. The distance of the branch from the outgroup represents the relative time of origin.
- Between nodes are the points in time when the ancestor changes from the primitive to the derived, or more recent, condition.
Student activity 3 relates to phylogenetic tree creation
Methods of classifying organisms
Phenetics and cladistics are two modern analytical approaches used in determining taxonomic relationships.
Phenetics bases the classification of an organism entirely on measurable similarities and differences; no assumptions of homology are made. Phenetics compares as many anatomical characteristics as possible to determine relatedness. Skeptics of this approach claim that phenotypic similarity alone is not sufficient to judge phylogenetic relationships.
Cladistic analysis orders organisms along a phylogenetic tree in branches, and describes the extent of divergence between the branches. If the analysis is based on a molecular sequence alignment, two major computational approaches may be employed in order to construct a tree:
- Distance-based: The overall distance between all sequence pairs is calculated
to construct a tree
- Neighbor Joining (NJ) is one of the most widely used distance-based methods for building phylogenetic trees.
- One limitation of this method is that under some conditions, it systematically constructs a wrong tree (this is called a bias).
- Character-based: Individual substitutions along sequence pairs are used to
derive ancestral relationships.
- Maximum Parsimony (MP) is a character-based method that is one of the most widely used and accepted in systematics.
- MP seeks the phylogenetic tree that minimizes the total number of changes (or branches) to illustrate evolutionary relationships.
- MP is based on the theory that the simplest explanation for observed
phenomena is the most probable.
PHYLIP and MEGA are two shareware programs that use NJ and MP methods to build
phylogenetic trees. Visit the links
for this resource to obtain more information about them and download.
Data of two main types are used to determine evolutionary relatedness: anatomical
and molecular. Systematists construct phylogenetic trees to represent inferred
relationships between organisms. The strongest support for any phylogenetic hypothesis
is agreement between molecular data and anatomical evidence (from living or fossilized
organisms). However, representing the history and diversity of life in one classification
system is a goal far from realization.