Essential Biochemistry   Help
Home Exercises Quizzes Weblinks Reviews Structures Activities
   Student Activities Chapter

Phylogenetic Trees

Lecture Resources


To understand the history and diversity of life is a preeminent goal of modern science. From a biological sciences perspective, the field that claims this pursuit is called systematics, which is defined as the study of biological diversity in an evolutionary context. Within the study of systematics, scientists trace the phylogeny, or evolutionary history, of a species or group of related species.

The ideal of a systematist is to account for the evolutionary history of all species, dating back to the very origin of life. The modern systematist employs techniques that classify organisms based on anatomical and molecular characteristics. Anatomically characterizing an organism involves two main approaches: studying the morphology of animals and analyzing the fossil record. Molecularly characterizing an organism uses various sequencing techniques to identify similarities in genetic information between organisms as expressed in nucleic acids or proteins.

Data generated from various techniques are used to develop hypotheses that ultimately classify an organism or species based on its characteristics. This involves employing a system of record keeping that allows the systematist to organize vast and diverse sets of comparative information. Phylogenetic trees, or diagrams that trace evolutionary relationships, serve this purpose. Phylogenetic trees are constructed to record the hypothesized classifications of organisms. If a group of organisms is hypothesized to share a common ancestor, the group is referred to as monophyletic. If members of a group did not all evolve from a common ancestor, the group is referred to as polyphyletic.


A traditional approach to classifying organisms relies upon the simple hypothesis that the greater the anatomical similarities between organisms, the more related they might be in evolutionary history. This is based on two major approaches: characterizing the morphology of living animals, plants, and microorganisms; and studying fossils, the preserved impressions left by organisms.


The morphology of an organism is simply a description of its physical characteristics. If the organism under study is extinct or impossible to resolve with modern microscopy techniques, observing morphology is unfeasible.

The fossil record

The fossil record contains fossilized remains and imprints whose age is estimated by the age of the surrounded rock. The oldest known fossils are believed to be approximately 3.5 billion years old and represent the existence of bacterium-like life. An example of a dating technique used to determine the ages of rocks and fossils on a scale of absolute time is radiometric dating.


Recent advances in biochemistry have exposed powerful molecular tools for constructing phylogenetic trees. Protein, mRNA, rRNA, and genomic sequencing are molecular techniques commonly used in identifying and characterizing organisms.

Molecular sequence data can be used to calibrate “molecular clocks” based on the rate of mutations in protein and nucleic acid sequences. The rationale of molecular clocks is that the observed number of changes in an amino acid or nucleotide sequence may be approximately proportional to the evolutionary time elapsed.

The National Center for Biotechnology Information houses a public database of sequenced nucleotides and proteins and develops sequence analysis shareware.

Protein data

The amino acid sequence of cytochrome c has been analyzed in over 100 eukaryotic species, and the molecular data support the notion that cytochrome c is an evolutionarily conservative protein. Shown below are two protein sequence comparisons for cytochrome c: between human and either rat or yeast.

Substitutions within the primary structure of cytochrome c are relatively constant over time, which makes characterizing cytochrome c a potentially useful molecular clock. For example, note the cytochrome c sequence similarity between humans and rats (91%) is much higher than between humans and yeast (64%). This indicates that the evolutionary path leading to the human species diverged with that of yeast far before its divergence from rat. A comparison of cytochrome c sequences from different species has been used to order the divergence of species in relative time.

Determining the absolute timing of events using molecular clocks attracts skepticism for several reasons. The fact that different proteins evolve at different rates makes determining the divergence rate of a particular protein problematic. Also, changes in generation time or metabolic rate may affect a mutation rate, making molecular clocks less predictable.

Method of protein sequencing and alignment

A common approach to protein sequencing involves repeatedly cleaving a polypeptide (using a protein-digesting enzyme), separating the fragments using chromatography, and sequencing the resulting small fragments. The entire protein sequence is then reconstructed by matching regions of sequence overlap seen in the small fragments.

Algorithms may be used to compare protein sequence data against a database. These algorithms provide local information (within a defined region) or global information (which accounts for gaps or other aberrations that may mask multiple regions of similarity). This technology has greatly facilitated the grouping of related proteins into families (an example is the serine protease family of enzymes, which includes trypsin, chymotrypsin, and elastase).

The significance of protein sequencing

Direct protein sequencing is an indispensable tool for several reasons. For example, the position of disulfide bonds can be resolved using protein sequencing techniques. Also, protein modifications, such as the excision of residues or the covalent attachment of other groups, can be detected. These changes in protein sequence and structure would evade detection by nucleic acid sequencing efforts.

One drawback to protein sequencing is that only those genes that code for proteins are under scrutiny. This is often a small fraction of an organism’s total amount of molecular information.

Student activity 1 relates to protein sequencing

Nucleic acid data

The sequencing of nuclear, mitochondrial, and genomic DNA, along with mRNA, tRNA, and rRNA, provides the systematist with a potential wealth of phylogenetic data. Because of the large amount of data generated from nucleic acid sequencing, especially on the genomic level, the complexity of the algorithms used in sequence analysis is far greater than what is needed to analyze protein sequence data.

Method of DNA sequencing and alignment

The most commonly used method for high-throughput DNA sequencing is the chain-termination method, which makes identifying 10,000 bases per day easily achievable.

Complex algorithms are used to quantitate sequence differences between copies of a given gene in the same or different species. Genes are said to be homologous if they share a common ancestor. Genes that are similar in sequence may or may not be homologous.

Nucleic acid sequencing and advancements in systematics

  • The use of ribosomal RNA (rRNA) sequencing led to the identification of the Archaea lineage. Current data supports the notion that organisms have been evolving in three independent lineages (Bacteria, Archaea, and Eukarya) for over 1.5 billion years.
  • Multiple strains of HIV were discovered from sequencing the viral genome (which consists of a single-stranded RNA molecule). Two known types of HIV exist, HIV-1 and HIV-2. The two types are further broken down into groups and subgroups. The different strains display varying degrees of virulence and possibly differing modes of transmission, as well.

Below is a small portion of the nucleic acid alignment for 2 different HIV strains, obtained directly from the BLAST program at the NCBI website. Alignments such as this help characterize the different strains of HIV and their evolutionary relationship to one another.

  1. Human HIV-1 complete genome: 328902 (gi number); M62320 (accession number)
  2. Human HIV-2 complete genome: 9628880 (gi number); NC_001722 (accession number)

Limitations of nucleic acid sequencing

Not all of an organism’s genome is composed of genes and distinguishing genes from extra DNA, or “junk” DNA, is far from trivial. Adding to the complexity, introns must be distinguished from exons within a gene.

The impact of genomic sequencing on phylogeny is minor to date. The main reason for this is that the genomes of relatively few species (around 100) have been sequenced, and these organisms are not all closely related.

Student activity 2 relates to nucleic acid sequencing


Phylogenetic trees are constructed in a variety of ways to summarize the evolutionary relatedness of different organisms. The data represented in a phylogenetic tree may come from (but are not limited to) observations of an organism’s anatomical features and/or molecular sequence information.

Constructing a simple phylogenetic tree

Phylogenetic trees may be constructed using a variety of methods and datasets. In the hypothetical example described below, four study-groups (A, B, C, and D) are divided among the evolutionary branches of a phylogenetic tree (the study groups are analogous to different species), serving to illustrate the historical pattern of change among groups and to postulate relationships.

  • First, the study groups are compared to an “outgroup.” An outgroup is related to the study groups, but not as closely related as the study groups are to each other. Using an outgroup as a reference point provides an objective way to distinguish more primitive groups from more recent groups.
  • Second, a node is drawn (illustrated by an intersection point in the tree), symbolizing a common ancestor to all groups shown above the node.
  • Third, additional branch points are added to symbolize the origin of groups possessing novel homologies. The distance of the branch from the outgroup represents the relative time of origin.
  • Between nodes are the points in time when the ancestor changes from the primitive to the derived, or more recent, condition.

Student activity 3 relates to phylogenetic tree creation

Methods of classifying organisms

Phenetics and cladistics are two modern analytical approaches used in determining taxonomic relationships.


Phenetics bases the classification of an organism entirely on measurable similarities and differences; no assumptions of homology are made. Phenetics compares as many anatomical characteristics as possible to determine relatedness. Skeptics of this approach claim that phenotypic similarity alone is not sufficient to judge phylogenetic relationships.


Cladistic analysis orders organisms along a phylogenetic tree in branches, and describes the extent of divergence between the branches. If the analysis is based on a molecular sequence alignment, two major computational approaches may be employed in order to construct a tree:

  • Distance-based: The overall distance between all sequence pairs is calculated to construct a tree
    • Neighbor Joining (NJ) is one of the most widely used distance-based methods for building phylogenetic trees.
    • One limitation of this method is that under some conditions, it systematically constructs a “wrong” tree (this is called a bias).
  • Character-based: Individual substitutions along sequence pairs are used to derive ancestral relationships.
    • Maximum Parsimony (MP) is a character-based method that is one of the most widely used and accepted in systematics.
    • MP seeks the phylogenetic tree that minimizes the total number of changes (or branches) to illustrate evolutionary relationships.
    • MP is based on the theory that the simplest explanation for observed phenomena is the most probable.

PHYLIP and MEGA are two shareware programs that use NJ and MP methods to build phylogenetic trees. Visit the links for this resource to obtain more information about them and download.


Data of two main types are used to determine evolutionary relatedness: anatomical and molecular. Systematists construct phylogenetic trees to represent inferred relationships between organisms. The strongest support for any phylogenetic hypothesis is agreement between molecular data and anatomical evidence (from living or fossilized organisms). However, representing the history and diversity of life in one classification system is a goal far from realization.


WILEY© 2004 | John Wiley & Sons, Inc. | All Rights Reserved | Privacy PolicyScience Technologies