Understanding Phylogenetic Analysis

ENTM201L - General Entomology Laboratory | UC Riverside

Listen to Theory Module

0:00 / 0:00

Listen to this module while following along with the text below, or download for offline study.

Understanding Phylogenetic Analysis

From DNA Sequences to Evolutionary Trees

ENTM201L - Lab Theory


Why Build Phylogenetic Trees?

After Sanger sequencing reveals your mosquito COI sequences in, you possess genetic barcodes that identify species. But DNA sequences contain far more information than just species names. They record evolutionary history - the branching pattern of descent from common ancestors.

Phylogenetic trees (also called phylogenies or evolutionary trees) visualize these relationships, allowing you to:

In ENTM201L, you will build a phylogenetic tree from your class COI sequences plus reference sequences from GenBank, using industry-standard software (MAFFT for alignment, IQ-TREE for tree building). This connects your lab work to real-world applications in vector surveillance, invasion biology, and evolutionary research.


What is a Phylogenetic Tree?

Tree Anatomy and Terminology

A phylogenetic tree is a branching diagram showing evolutionary relationships among organisms or genes.

Basic components:
 ┌─── Aedes albopictus (USA)
 ┌─────┤
 │ └─── Aedes albopictus (Japan)
 ┌─────┤
 │ │ ┌─── Aedes aegypti (Brazil)
 │ └─────┤
 ────┤ └─── Aedes aegypti (Florida)
 │
 │ ┌───────── Culex pipiens
 └─────┤
 └───────── Culex tarsalis
Key terms:
TermDefinitionExample
Tips (leaves)Terminal nodes representing species/samplesAedes albopictus (USA)
BranchesLines connecting nodes, represent evolutionary changeHorizontal/diagonal lines
Nodes (internal)Branching points representing common ancestorsWhere Ae. albopictus USA & Japan connect
RootBase of tree, most recent common ancestor of all tipsLeftmost point where all branches connect
CladeGroup of organisms sharing common ancestorAll Aedes species = one clade
Sister taxaTwo closest relatives sharing immediate common ancestorAe. albopictus USA & Japan
OutgroupDistantly related species used to root treeCulex (if analyzing Aedes)

Rooted vs. Unrooted Trees

Rooted tree: Unrooted tree: For ENTM201L: We will build rooted trees using Culex or Anopheles as outgroup to root Aedes trees.

Branch Lengths: Evolutionary Distance

Branch length represents amount of evolutionary change (genetic divergence). Measured as:

- 0.01 = 1% divergence (1 substitution per 100 bp on average)

- 0.10 = 10% divergence

- Within species: 0.001-0.02 (0.1-2% divergence)

- Between species (same genus): 0.05-0.15 (5-15%)

- Between genera: 0.20-0.30 (20-30%)

Example interpretation:

Multiple Sequence Alignment: The Foundation

Why Alignment is Essential

Problem: You have 20 COI sequences of varying lengths (650-700 bp each). To compare them, you need to know which positions are homologous (descended from same ancestral base). Without alignment:
Sequence 1: ATCGATCGAA...
Sequence 2: ATCGTCGAA... (deleted one A)
Sequence 3: ATCGAATCGAA... (inserted one A)

You cannot tell position 5 in Seq 1 corresponds to position 4 in Seq 2 or position 6 in Seq 3.

With alignment (inserts gaps to maintain positional homology):
Sequence 1: ATCGATCGAA
Sequence 2: ATCG-TCGAA (gap shows deletion)
Sequence 3: ATCGAATCGAA (insertion)
 ||||*||||||
 Positions now comparable
Alignment principle: Insert gaps (indels) to maximize similarity across sequences, ensuring each column represents one ancestral site.

How MAFFT Works

MAFFT (Multiple Alignment using Fast Fourier Transform) is the most widely used alignment software for nucleotide and protein sequences. Algorithm overview:

1. Pairwise distance calculation: Compare all sequences to each other

2. Guide tree construction: Build rough tree showing approximate relationships

3. Progressive alignment: Align most similar sequences first, gradually add more distant sequences

4. Refinement: Iteratively improve alignment by realigning subgroups

Why MAFFT for COI: Command-line usage (if using terminal):
mafft --auto input.fasta > aligned.fasta
Alignment strategies:

Alignment Quality: What to Check

High-quality alignment: Poor-quality alignment: Inspecting alignment (in Geneious or MEGA):

1. Translate to protein: COI codes for Cytochrome Oxidase protein

- Should have no stop codons (except terminal TAA)

- Conserved amino acids (e.g., active site residues) should align

2. Check for NUMTs: Nuclear mitochondrial pseudogenes often have stop codons or frameshifts

- If present, exclude from phylogeny

3. Trim poorly aligned regions: Remove first/last 10-20 bp if ambiguous


Phylogenetic Inference Methods

Distance-Based Methods (UPGMA, Neighbor-Joining)

Principle: Calculate genetic distance between all pairs of sequences, cluster most similar pairs iteratively. UPGMA (Unweighted Pair Group Method with Arithmetic Mean): Neighbor-Joining: Limitations:

Maximum Likelihood (ML): The Gold Standard

Principle: Find the tree (topology + branch lengths) that maximizes the probability of observing your sequence data, given a substitution model. How it works:

1. Propose a tree topology

2. Calculate likelihood: P(data | tree, model)

3. Optimize branch lengths to maximize likelihood

4. Try different topologies, keep best one

5. Repeat until convergence

Why ML is superior: Software:

Bayesian Inference: Posterior Probabilities

Principle: Calculate posterior probability of each tree given data, using Bayesian statistics. Advantages over ML: Disadvantages: Software: MrBayes, BEAST (for divergence dating) For ENTM201L: We use ML (IQ-TREE) because it is fast, accurate, and easier to run than Bayesian methods.

Substitution Models: How DNA Evolves

Why Models Matter

Not all nucleotide changes are equally likely. Transitions (A↔G, C↔T) occur more frequently than transversions (A↔C, A↔T, G↔C, G↔T) due to:

Substitution models describe how nucleotides change over time, allowing accurate estimation of evolutionary distances and tree topologies.

Common Models

Jukes-Cantor (JC69) - Simplest model: Kimura 2-parameter (K2P) - Transition/transversion distinction: General Time-Reversible (GTR) - Most complex: Rate heterogeneity (+I+G):

Model Selection

IQ-TREE automatic model selection: Typical result for COI:
Best-fit model: TN+F+I+G4
 TN = Tamura-Nei (different transition rates for A↔G vs. C↔T)
 +F = Unequal base frequencies
 +I = Invariant sites
 +G4 = Gamma distribution of rates (4 categories)

Base frequencies:
 A=29.8%, C=15.4%, G=16.1%, T=38.7% (AT-rich, typical for insect mtDNA)

Bootstrap Values: Measuring Confidence

What Bootstrap Measures

Problem: How confident are we in each node (clade) on the tree? Bootstrap resampling (introduced by Felsenstein, 1985) tests tree robustness:

1. Take your alignment (e.g., 650 bp)

2. Resample with replacement: Randomly select 650 positions (some sampled multiple times, some not at all)

3. Build tree from resampled alignment

4. Repeat 1,000 times

5. Count how often each clade appears in the 1,000 bootstrap trees

6. Bootstrap value = % of trees containing that clade

Interpretation: Example:
 ┌─── Ae. albopictus (USA)
 ┌─────┤ 99%
 │ └─── Ae. albopictus (Japan)
 ┌─────┤
 │ │ ┌─── Ae. aegypti (Brazil)
 ────┤ └─────┤ 95%
 │ 85% └─── Ae. aegypti (Florida)
 │
 └───────────── Culex pipiens (outgroup)
Interpretation:

UFBoot: Ultrafast Bootstrap

Standard bootstrap: Slow (rebuilds tree 1,000 times) Ultrafast bootstrap (UFBoot) in IQ-TREE: Result: 100-sequence tree with 1,000 bootstrap replicates in 10-20 minutes instead of days.

Reading and Interpreting Trees

Topology: Who is Related to Whom?

Monophyletic group (clade): All descendants of a common ancestor Paraphyletic group: Some but not all descendants of common ancestor Polyphyletic group: Members do not share recent common ancestor Tree reading principle: Only monophyletic groups are evolutionarily meaningful. Paraphyletic and polyphyletic groups are artificial.

Rotating Branches: Trees Have No Intrinsic Order

Critical concept: You can rotate branches around any node without changing relationships.

These three trees are identical:

Tree 1: Tree 2: Tree 3:
 ┌─── A ┌─── B ┌─── C
────┤ ────┤ ────┤
 │ ┌─── B │ ┌─── A │ ┌─── A
 └───┤ └───┤ └───┤
 └─── C └─── C └─── B

All three show: (A,B) are sisters, and C is outgroup. Order of tips does not matter.

Implication: Do not assume species listed next to each other are more closely related than distant ones. Only branching pattern matters.

Rooting: Determining Ancestral Direction

Unrooted tree cannot tell direction of evolution:
 A
 |
 ────┼────
 | |
 B C

Did A diverge first, or B, or C? Cannot tell.

Rooting with outgroup: Add distantly related species known to be outside your group of interest.
Outgroup (Culex) ────┬────┬─── A (Ae. albopictus USA)
 │ └─── B (Ae. albopictus Japan)
 │
 └──────── C (Ae. aegypti)

Now clear: Outgroup diverged first, then A&B lineage split from C, then A and B split.

Choosing outgroup:

Building Trees with IQ-TREE

Why IQ-TREE?

IQ-TREE (Nguyen et al., 2015) is the fastest and most accurate ML software for phylogenetics. Features: Compared to alternatives:

Basic IQ-TREE Workflow

Command-line usage (recommended for reproducibility):
# Align sequences with MAFFT
mafft --auto mosquito_sequences.fasta > aligned.fasta

# Build ML tree with IQ-TREE
iqtree -s aligned.fasta -m MFP -bb 1000 -nt AUTO

# Parameters explained:
# -s aligned.fasta: Input alignment file
# -m MFP: Auto model selection (ModelFinder Plus)
# -bb 1000: Ultrafast bootstrap with 1000 replicates
# -nt AUTO: Use all available CPU cores
Output files: Typical runtime:

Using Container Tools (Docker/Apptainer)

For ENTM201L: We provide containerized IQ-TREE for ease of use. Why containers? Example with Docker:
docker run -v $(pwd):/data iqtree/iqtree -s /data/aligned.fasta -m MFP -bb 1000
Example with Apptainer (common on HPC clusters):
apptainer exec iqtree.sif iqtree -s aligned.fasta -m MFP -bb 1000

Visualizing Trees

Newick Format: Text Representation

Newick is a text format for trees:
((A:0.1,B:0.15)85:0.05,C:0.25);
Interpretation: Human-readable but not visually intuitive. Need visualization software.

Visualization Tools

FigTree (desktop app): iTOL (Interactive Tree of Life, web-based): ggtree (R package): For ENTM201L: Use FigTree for quick visualization, iTOL for publication-quality figures.

Annotating Trees with Metadata

Example: Color Aedes albopictus samples by geographic origin. iTOL annotation file (tab-delimited):
TREE_COLORS
SEPARATOR TAB
DATA
Ae_albopictus_USA range #FF0000 Japan
Ae_albopictus_Japan range #0000FF USA
Result: USA samples red, Japan samples blue, immediately shows geographic clustering.

Real-World Application: Tracking Aedes albopictus Invasion

Background: A Global Invader

Aedes albopictus (Asian tiger mosquito):

Phylogenetic Approach

Research question: Where did USA populations originate? Methodology:

1. Sample collection: Collect Ae. albopictus from 50 sites across USA (California, Texas, Florida, New York)

2. DNA extraction + sequencing: COI barcoding (712 bp)

3. Reference sequences: Download 100 Ae. albopictus COI from GenBank (Asia, Europe, South America)

4. Alignment: MAFFT with 150 sequences

5. Phylogenetic tree: IQ-TREE with GTR+I+G model, 1,000 bootstrap replicates

6. Visualization: iTOL, color by geographic origin

Results (hypothetical but based on real studies):
 ┌─── USA West Coast (CA, WA)
 ┌─────┤ 99%
 │ └─── Japan (Tokyo, Osaka)
 │
 ┌─────┤ ┌─── USA East Coast (FL, NY)
 │ └─────┤ 95%
 │ 78% └─── Southern China (Guangzhou)
 ────┤
 │ ┌─── Europe (Italy, France)
 │ ┌─────┤ 97%
 └─────┤ └─── Northern China (Beijing)
 │ 88%
 └───────── South America (Brazil, Argentina)
Interpretation:

1. Multiple introductions: USA populations are not monophyletic (West Coast clusters with Japan, East Coast with Southern China)

2. Source populations identified: West Coast invasion from Japan (99% bootstrap), East Coast from Southern China (95% bootstrap)

3. Invasion routes: Likely via used tire shipments from specific Asian ports

4. No evidence of USA-Europe connection: European populations cluster with Northern China, independent from USA

Public health implications:

Case Study 2: Cryptic Species in Culex pipiens Complex

Culex pipiens complex includes multiple morphologically identical forms with different biology: COI phylogeny (simplified):
 ┌─── Cx. p. pipiens (Europe, outdoor)
 ┌─────┤ 65%
 │ └─── Cx. p. molestus (Europe, basements)
 ────┤
 │ ┌─── Cx. p. quinquefasciatus (USA, tropics)
 └─────┤ 85%
 78% └─── Cx. p. pipiens (USA, outdoor)
Problem: Bootstrap values low (<80%), forms not clearly separated by COI. Solution: Use additional markers: Lesson: COI alone is insufficient for some species complexes. Phylogenetics guides further investigation.

Literature and Software Resources

Key Papers on Phylogenetic Methods

1. Phylogenetic Inference:

- Felsenstein, J. (1981). Evolutionary trees from DNA sequences: A maximum likelihood approach. Journal of Molecular Evolution 17(6): 368-376. https://doi.org/10.1007/BF01734359

- Yang, Z., & Rannala, B. (2012). Molecular phylogenetics: Principles and practice. Nature Reviews Genetics 13(5): 303-314. https://doi.org/10.1038/nrg3186

2. IQ-TREE Software:

- Nguyen, L. T., et al. (2015). IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution 32(1): 268-274. https://doi.org/10.1093/molbev/msu300

- Kalyaanamoorthy, S., et al. (2017). ModelFinder: Fast model selection for accurate phylogenetic estimates. Nature Methods 14: 587-589. https://doi.org/10.1038/nmeth.4285

3. Bootstrap Methods:

- Felsenstein, J. (1985). Confidence limits on phylogenies: An approach using the bootstrap. Evolution 39(4): 783-791. https://doi.org/10.2307/2408678

- Hoang, D. T., et al. (2018). UFBoot2: Improving the ultrafast bootstrap approximation. Molecular Biology and Evolution 35(2): 518-522. https://doi.org/10.1093/molbev/msx281

Multiple Sequence Alignment

4. MAFFT:

- Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: Improvements in performance and usability. Molecular Biology and Evolution 30(4): 772-780. https://doi.org/10.1093/molbev/mst010

5. Alignment Quality:

- Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science 27(1): 135-145. https://doi.org/10.1002/pro.3290

Mosquito Phylogenetics Applications

6. DNA Barcoding and Phylogeography:

- Kamgang, B., et al. (2011). Geographic and ecological distribution of the dengue and chikungunya virus vectors Aedes aegypti and Aedes albopictus in three major Cameroon towns. Medical and Veterinary Entomology 25(2): 132-141.

- Paupy, C., et al. (2009). Comparative role of Aedes albopictus and Aedes aegypti in the emergence of Dengue and Chikungunya in Central Africa. Vector-Borne and Zoonotic Diseases 9(6): 585-595.

7. Aedes albopictus Invasion Genetics:

- Goubert, C., et al. (2016). Population genetics of Aedes albopictus invading populations. PLoS ONE 11(1): e0147673. https://doi.org/10.1371/journal.pone.0147673

- Kotsakiozi, P., et al. (2017). Population genomics of the Asian tiger mosquito, Aedes albopictus: Insights into the recent worldwide invasion. Ecology and Evolution 7(23): 10143-10157.

8. Cryptic Species and COI Barcoding Limitations:

- Cywinska, A., et al. (2006). Identifying Canadian mosquito species through DNA barcodes. Medical and Veterinary Entomology 20(4): 413-424.

- Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-based registry for all animal species: The Barcode Index Number (BIN) system. PLoS ONE 8(7): e66213.

Visualization Tools

9. FigTree: http://tree.bio.ed.ac.uk/software/figtree/

10. iTOL: Letunic, I., & Bork, P. (2021). Interactive Tree Of Life (iTOL) v5: An online tool for phylogenetic tree display and annotation. Nucleic Acids Research 49(W1): W293-W296. https://doi.org/10.1093/nar/gkab301

11. ggtree: Yu, G., et al. (2017). ggtree: An R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 8(1): 28-36. https://doi.org/10.1111/2041-210X.12628


Key Takeaways

Phylogenetic Trees Reveal Evolutionary History

DNA sequences contain information about ancestry. Trees visualize this, showing:

Multiple Sequence Alignment is Critical

Alignment ensures positional homology. Poor alignment = incorrect tree. Always:

Maximum Likelihood is the Gold Standard

ML finds the tree that maximizes P(data | tree, model). IQ-TREE makes this:

Bootstrap Values Quantify Confidence

Bootstrap ≥95% = strong support. Lower values mean uncertainty. Never interpret trees without bootstrap values - topology alone is insufficient.

Outgroup Rooting Shows Direction

Unrooted trees show relationships but not ancestry. Outgroup (distantly related species) roots the tree, revealing which lineages are ancestral vs. derived.

Real-World Impact

Phylogenetics informs:


Connection to Lab Activities

In lab, you will:

1. Collect sequences:

- Your trimmed COI sequences from Sanger sequencing

- Classmates' sequences (different Aedes samples)

- Reference sequences from GenBank (outgroups, known species)

2. Multiple sequence alignment:

- Use MAFFT (via web server or command line)

- Export aligned FASTA file

3. Build phylogenetic tree:

- IQ-TREE with automatic model selection (-m MFP)

- Ultrafast bootstrap (-bb 1000)

- Examine .iqtree file (best model, log-likelihood)

4. Visualize tree:

- Load .treefile into FigTree

- Display bootstrap values

- Root with outgroup (Culex or Anopheles)

- Color tips by species or geographic origin

5. Interpret results:

- Are your samples monophyletic (cluster together)?

- What is % divergence within species? Between species?

- Do bootstrap values support major clades?

- Can you infer invasion routes or population structure?

6. Report findings:

- Include tree figure in lab report

- Discuss bootstrap support for key nodes

- Compare your results to published phylogenies

Remember: Phylogenetics is the capstone of your molecular workflow. It transforms individual DNA sequences into evolutionary insights, connecting your lab work to global scientific questions about mosquito evolution, invasion biology, and disease transmission.
Document prepared for ENTM201L - General Entomology Laboratory UC Riverside, Department of Entomology Fall 2025

References

Nguyen et al. (2015) - IQ-TREE Foundation

Citation: Nguyen, L.-T., Schmidt, H. A., von Haeseler, A., & Minh, B. Q. (2015). IQ-TREE: A fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Molecular Biology and Evolution, 32(1), 268-274.

DOI: 10.1093/molbev/msu300 | PMID: 25371430

Introduced IQ-TREE, a fast stochastic maximum likelihood phylogenetic inference tool. Features novel tree search algorithm using stochastic perturbations, often finding trees with higher likelihoods compared to RAxML and PhyML. Significantly faster than competing ML methods while maintaining or improving accuracy. Built-in ModelFinder for automatic substitution model selection.

Trifinopoulos et al. (2016) - W-IQ-TREE Web Server

Citation: Trifinopoulos, J., Nguyen, L.-T., von Haeseler, A., & Minh, B. Q. (2016). W-IQ-TREE: A fast online phylogenetic tool for maximum likelihood analysis. Nucleic Acids Research, 44(W1), W232-W235.

DOI: 10.1093/nar/gkw256 | PMID: 27084950

Web-based interface making IQ-TREE accessible without command-line experience. Automated workflow handles sequence alignment, model selection, tree inference, and bootstrapping. Ideal for teaching environments and researchers without bioinformatics background.

Minh et al. (2020) - IQ-TREE 2

Citation: Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A., & Lanfear, R. (2020). IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Molecular Biology and Evolution, 37(5), 1530-1534.

DOI: 10.1093/molbev/msaa015 | PMID: 32011700

Major software update with enhanced features for phylogenomics. New models include mixture models, heterotachy, and rate heterogeneity across sites. Introduced concordance factors for assessing phylogenetic signal. Better scalability for genome-scale datasets.

Why IQ-TREE for Teaching?

IQ-TREE is ideal for DNA barcoding phylogenetics because: (1) fast enough for classroom use (minutes, not hours), (2) automatic model selection via ModelFinder eliminates need to understand complex substitution models, (3) ultrafast bootstrap provides 100-1000 replicates in reasonable time, (4) user-friendly with relatively simple command-line syntax, (5) well-documented with extensive tutorials, and (6) actively maintained with regular updates. Performance benchmarks show IQ-TREE finds higher likelihood trees than RAxML in 87.1% of cases while being 2-10x faster than PhyML.