ENTM201L - General Entomology Laboratory | UC Riverside
Listen to this module while following along with the text below, or download for offline study.
After successful PCR amplification and gel verification of your mosquito COI gene in Lab, you have a 712 bp DNA fragment. But what does that fragment actually say? What is the sequence of A's, T's, G's, and C's that makes up this genetic barcode?
DNA sequencing reveals the precise nucleotide order, enabling you to:In lab, you will submit your purified COI PCR products for Sanger sequencing, the gold standard method for reading DNA sequences with 99.9% accuracy. Understanding how this technology works transforms you from a passive consumer of sequence data into an informed scientist who can critically evaluate data quality and troubleshoot problems.
In the 1970s, molecular biologists could manipulate DNA (cutting with restriction enzymes, ligating into plasmids, amplifying genes) but could not read its sequence. It was like being able to photocopy a book in a foreign language without being able to read the words.
1. DNA polymerase binds to primer-template complex
2. Adds dNTPs (deoxynucleotide triphosphates: dATP, dTTP, dGTP, dCTP)
3. Forms phosphodiester bond: 3'-OH of growing strand + 5'-phosphate of incoming dNTP
4. Releases pyrophosphate (PPi), reaction continues
5. Polymerase extends chain indefinitely
Key requirement: The 3' carbon of the sugar (deoxyribose) must have a hydroxyl group (-OH) for the next nucleotide to attach.Growing strand: 5'-ATCGATCG-[3'-OH]
↓
Incoming dNTP: [5'-P]-C-[3'-OH]
↓
Extended strand: 5'-ATCGATCGC-[3'-OH] (ready for next nucleotide)
When a ddNTP is incorporated:
1. Polymerase adds ddNTP to growing strand (forms phosphodiester bond normally)
2. Next incoming nucleotide cannot attach (no 3'-OH to react with)
3. Chain terminates permanently
4. Fragment is "frozen" at that specific length
The mixture: Sequencing reaction contains:After synthesis, you have a population of fragments of every possible length, each terminated by a ddNTP.
- ddATP → green fluorescence (520 nm)
- ddTTP → red fluorescence (580 nm)
- ddGTP → blue fluorescence (608 nm)
- ddCTP → yellow fluorescence (640 nm)
1. Sample loading:
- DNA mixed with formamide (denatures to single strands)
- Heat to 95°C (ensures complete denaturation)
- Cool on ice (prevents reannealing)
2. Electrokinetic injection:
- Capillary tip placed in sample well
- Voltage pulse (2 kV, 15 seconds) pulls DNA into capillary
- Amount loaded depends on concentration (why 20-50 ng/µL is critical)
3. Electrophoresis:
- High voltage (10-12 kV) applied across capillary
- DNA migrates toward positive electrode
- Polymer matrix separates by size (shorter fragments faster)
4. Laser detection:
- Laser (488 nm or 532 nm) excites fluorescent dyes near end of capillary
- Each fragment emits colored light based on terminating ddNTP
- Detector measures fluorescence intensity and wavelength
5. Base calling:
- Software converts fluorescence peaks into sequence
- Time of detection → position in sequence
- Color of fluorescence → which base (A, T, G, or C)
Output: A chromatogram showing colored peaks representing each base.| Feature | Gel Slab | Capillary |
|---|---|---|
| Length | 30-50 cm | 36-80 cm (more separation) |
| Voltage | 1,500 V | 10,000 V (faster, sharper peaks) |
| Heat dissipation | Poor (gel heats unevenly) | Excellent (capillary wall cools efficiently) |
| Resolution | ~500 bp readable | 800-1,000 bp readable |
| Throughput | 1-4 samples/gel | 96 capillaries in parallel |
| Automation | Manual loading | Fully automated |
A chromatogram (also called electropherogram or trace file) is a graph showing:
- Green (A)
- Blue (C)
- Black (G)
- Red (T)
Example:High-quality region (bases 50-500):
╱╲ ╱╲ ╱╲ ╱╲
╱ ╲ ╱ ╲╱ ╲ ╱ ╲
G A T C A G T
↑ ↑ ↑ ↑ ↑ ↑ ↑
Sharp peaks, clear separation, minimal noise
Low-quality region (bases 1-30, 700-750):
╱╲╱╲╱╲ ╱╲
╱ ╳ ╲ ╱╲╱╲ ╱ ╲╱╲
? N ? N N T ? ?
↑ ↑ ↑
Overlapping peaks, "N" calls, high background noise
Phred score = -10 × log₁₀(Probability of error)
| Phred Score | Error Rate | Accuracy | Meaning |
|---|---|---|---|
| 10 | 1 in 10 | 90% | Low quality (unreliable) |
| 20 | 1 in 100 | 99% | Acceptable for some applications |
| 30 | 1 in 1,000 | 99.9% | High quality |
| 40 | 1 in 10,000 | 99.99% | Very high quality |
| 50 | 1 in 100,000 | 99.999% | Exceptional (rare) |
╱╲
╱ ╲
╱ ╲
G
↑
Single sharp peak, no background noise
Mixed base (heterozygote or contamination):
╱╲ ╱╲
╱ ╲╱ ╲
╱ ╲
G/A
↑
Two peaks at same position (sequence contains both G and A)
Interpretation: Either specimen is heterozygous at this site, or you sequenced a mixture of two individuals.
Ambiguous ("N" call):
╱╲ ╱╲╱╲
╱ ╳ ╲
╱ ╲ ╲
N
↑
Overlapping peaks, basecaller cannot decide
Interpretation: Low quality, trim this region from analysis.
Dye blob:
███████████
All colors
███████████
↑
Fluorescence from all dyes simultaneously
Interpretation: Unincorporated dye-labeled ddNTPs, usually at beginning of read. Ignore this region.
5'-ATCGATCGATCG...-3' (forward strand)
3'-TAGCTAGCTAGC...-5' (reverse strand, complementary)
You can sequence from either direction:
1. Confirm accuracy: Errors are random; if both reads agree, confidence is very high
2. Resolve ambiguities: If forward read has "N" at position 200, reverse read may have clear base
3. Full coverage: Forward read may be poor quality at 3' end; reverse read covers that region well
4. Detect heterozygotes: Mixed peaks in both reads confirm true heterozygosity
1. Sequence PCR product with forward primer → 700 bp readable sequence
2. Sequence same PCR product with reverse primer → 700 bp readable sequence (complementary)
3. Reverse-complement the reverse read (software does this automatically)
4. Align forward and reverse reads
5. Create consensus sequence: Where reads agree, call that base; where they differ, flag as ambiguity
Example:Position: 100 110 120
Forward read: ATCGATCGANTCGATCGA (N = ambiguous)
Reverse read: ATCGATCGATTCGATCGA (T = clear)
Consensus: ATCGATCGATTCGATCGA (T selected)
↑
Reverse read resolves ambiguity in forward read
Quality improvement:
Quality: ██░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░██
Bases: 1 50 400 700 800
↑ ↑
Trim Trim
Final: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
Bases: 50 400 700
└─────────────────┬─────────────────────┘
650 bp high-quality sequence
1. Break your query sequence into small "words" (11 bp default)
2. Search database for exact matches to these words
3. Extend matches in both directions to find longer alignments
4. Score alignments based on similarity
5. Report best matches with statistics
1. Go to NCBI BLAST: https://blast.ncbi.nlm.nih.gov/Blast.cgi
2. Select nucleotide BLAST (BLASTn for DNA)
3. Paste your trimmed sequence into query box
4. Choose database:
- Nucleotide collection (nr/nt): All GenBank sequences (~500 million)
- Barcode of Life (BOLD): Curated COI sequences (~9 million, better for species ID)
5. Optimize for "somewhat similar sequences" (megablast for close matches)
6. Run BLAST (takes 15-60 seconds)
7. Interpret results
BLAST output shows:Top hit:
Description: Aedes albopictus voucher UCR12345 cytochrome oxidase subunit I (COI) gene
Scientific name: Aedes albopictus
Max score: 1310
Total score: 1310
Query coverage: 98%
E-value: 0.0
% identity: 99.2%
Accession: MH123456
Key metrics:
| Metric | Meaning | Threshold for Species ID |
|---|---|---|
| % Identity | Percent of bases that match | >97% = same species |
| Query coverage | How much of your sequence matched | >80% (should be ~100% for COI) |
| E-value | Probability match occurred by chance | <1e-100 (lower is better) |
| Max score | Alignment quality score | Higher is better (no fixed threshold) |
1. Sequence length: >200 bp (after trimming)
- COI barcode: Prefer >500 bp
- Full gene: Prefer full-length
2. Quality: Phred >20 for >98% of bases
- Best practice: Phred >30 for entire sequence
3. Ambiguous bases: <1% "N" calls
- Example: 600 bp sequence should have <6 Ns
4. Vector/adapter removal: No primer sequences included
- Trim primers before submission
- Check for adapter contamination (if using NGS)
5. Metadata (required):
- Species name (or "uncultured organism" if unknown)
- Collection location (GPS coordinates preferred)
- Collection date
- Voucher specimen ID
- Gene name (e.g., "cytochrome c oxidase subunit I")
6. No contamination:
- Must be single-organism sequence (not mixture)
- BLAST should match expected organism (not bacteria, human, etc.)
1. Collect specimens from multiple populations/species
2. Extract DNA, PCR amplify COI
3. Sanger sequence all samples (forward + reverse)
4. Trim and assemble consensus sequences
5. Download reference sequences from GenBank (outgroups, related species)
6. Multiple sequence alignment (align all COI sequences, see phylogenetics module)
7. Build phylogenetic tree (maximum likelihood or Bayesian methods)
8. Interpret evolutionary relationships
1. No DNA: Template concentration too low (<5 ng/µL)
- Solution: Re-quantify with Qubit, concentrate sample if needed
2. Wrong primer: Sequencing primer doesn't match PCR primers
- Solution: Verify primer name/sequence, re-submit with correct primer
3. Primer degradation: Old primer stock (freeze-thaw damage)
- Solution: Order fresh primer
4. Template too dilute after cleanup:
- Solution: Concentrate by SpeedVac or ethanol precipitation
5. Salt contamination: Inhibits sequencing reaction
- Solution: Column cleanup before submission
1. Excess template: >100 ng/µL overwhelms sequencer
- Solution: Dilute to 20-50 ng/µL
2. Unincorporated dye-terminators: Cleanup incomplete
- Solution: Use column cleanup (removes free dyes)
3. Multiple templates: Sequenced mixture of PCR products
- Solution: Gel-extract correct band
1. Heterozygous specimen: Diploid locus, two alleles
- Note: COI is mitochondrial (haploid), shouldn't be heterozygous
- If in COI: Likely nuclear mitochondrial pseudogene (NUMT) co-amplified
2. Two individuals sequenced: Mixed DNA extraction
- Solution: Re-extract from single individual
3. Contamination: PCR product contaminated with other sample
- Solution: Re-amplify with fresh reagents, clean workspace
4. Multiple PCR products: Non-specific amplification
- Solution: Gel-extract 712 bp band specifically
1. Normal for long reads: Capillary resolution decreases over distance
- Solution: This is expected; trim after base 650-700
2. DNA degradation: Fragmented template produces short reads
- Solution: Use fresher PCR product, avoid freeze-thaw
3. Secondary structure: GC-rich regions cause polymerase stalling
- Solution: Add DMSO or betaine to sequencing reaction (facility does this)
1. Specimen mis-identified morphologically: Taxonomic error
- Solution: Re-examine morphology with keys, accept molecular ID
2. Contamination: Amplified DNA from different organism
- Solution: Check negative controls, re-extract
3. Primer non-specificity: Amplified wrong gene
- Solution: Verify PCR product size on gel (should be 712 bp)
4. NUMT: Nuclear mitochondrial pseudogene instead of mtDNA
- Solution: Look for stop codons (translate sequence to protein; NUMTs often have premature stops)
1. Trap mosquitoes regularly (CO₂ traps, gravid traps)
2. Morphological ID (many Culex species look similar)
3. Extract DNA from 1 leg (preserve body for pathogen testing)
4. PCR + Sanger sequence COI
5. BLAST for species confirmation
6. Update surveillance database (species distribution maps)
Impact: Accurate species ID is critical because:Many "species" are actually species complexes (multiple cryptic species).
Anopheles gambiae complex:1. Sequence COI from susceptible vs. resistant populations
2. Build phylogenetic tree (shows population relationships)
3. Genotype kdr mutations (separate assay)
4. Map kdr onto phylogeny: Reveals whether resistance evolved once (single origin) or multiple times (convergent evolution)
Example: Aedes aegypti pyrethroid resistance in Caribbean1. Original Chain Termination Method:
- Sanger, F., et al. (1977). DNA sequencing with chain-terminating inhibitors. Proceedings of the National Academy of Sciences 74(12): 5463-5467. https://doi.org/10.1073/pnas.74.12.5463
2. Automated Fluorescent Sequencing:
- Smith, L. M., et al. (1986). Fluorescence detection in automated DNA sequence analysis. Nature 321: 674-679. https://doi.org/10.1038/321674a0
- Prober, J. M., et al. (1987). A system for rapid DNA sequencing with fluorescent chain-terminating dideoxynucleotides. Science 238(4825): 336-341.
3. Quality Assessment:
- Ewing, B., & Green, P. (1998). Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Research 8(3): 186-194. https://doi.org/10.1101/gr.8.3.186
- Ewing, B., et al. (1998). Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Research 8(3): 175-185.
4. COI Barcoding Foundation:
- Hebert, P. D. N., et al. (2003). Biological identifications through DNA barcodes. Proceedings of the Royal Society B 270(1512): 313-321. https://doi.org/10.1098/rspb.2002.2218
- Ratnasingham, S., & Hebert, P. D. N. (2007). BOLD: The Barcode of Life Data System. Molecular Ecology Notes 7(3): 355-364. https://doi.org/10.1111/j.1471-8286.2007.01678.x
5. BLAST Algorithm:
- Altschul, S. F., et al. (1990). Basic local alignment search tool. Journal of Molecular Biology 215(3): 403-410. https://doi.org/10.1016/S0022-2836(05)80360-2
6. Species Identification:
- Kumar, N. P., et al. (2007). DNA barcodes can distinguish species of Indian mosquitoes (Diptera: Culicidae). Journal of Medical Entomology 44(1): 1-7. https://doi.org/10.1093/jmedent/41.5.01
- Chan, A., et al. (2014). DNA barcoding: Complementing morphological identification of mosquito species in Singapore. Parasites & Vectors 7: 569.
7. Cryptic Species Detection:
- Cywinska, A., et al. (2006). Identifying Canadian mosquito species through DNA barcodes. Medical and Veterinary Entomology 20(4): 413-424. https://doi.org/10.1111/j.1365-2915.2006.00653.x
8. Invasive Species Tracking:
- Goubert, C., et al. (2016). Population genetics of Aedes albopictus (Diptera: Culicidae) invading populations in USA and Europe. PLoS ONE 11(1): e0147673. https://doi.org/10.1371/journal.pone.0147673
9. Sequencing Quality Standards:
- Richterich, P. (1998). Estimation of errors in "raw" DNA sequences: A validation study. Genome Research 8(3): 251-259.
- NCBI GenBank Submission Guidelines: https://www.ncbi.nlm.nih.gov/genbank/samplesubmit/
DNA sequencing transforms invisible molecular information into readable text (ATCGATCG...). This enables:
Sanger's insight: ddNTPs lack 3'-OH → terminate chain extension. By incorporating small amounts of fluorescently labeled ddNTPs into DNA synthesis reactions, we create a ladder of terminated fragments. Capillary electrophoresis separates these fragments by single nucleotides, and laser detection reads the terminating base color. The result is 99.9% accurate sequence data.
Comparing your sequence to millions of references answers "What species?" in seconds. >97% identity to reference COI = same species (for most organisms). Lower identity may indicate cryptic species, misidentification, or novel species.
In lab, you will:
1. Submit PCR products for Sanger sequencing:
- Prepare cleanup samples (ExoSAP or column)
- Quantify with Qubit (target 20-50 ng/µL)
- Submit with primers to sequencing core
2. Analyze chromatograms (when results return):
- View in Geneious or BioEdit
- Assess quality (Phred scores)
- Trim low-quality ends
3. Assemble consensus (if sequenced both strands):
- Align forward and reverse reads
- Generate consensus sequence
4. BLAST for species identification:
- Submit to NCBI nucleotide BLAST
- Compare to BOLD database
- Interpret % identity (>97% confirms species)
5. Prepare for phylogenetic analysis:
- Export trimmed sequences as FASTA
- Download reference sequences from GenBank
- Ready for multiple sequence alignment ( phylogenetics module)
Remember: Sequencing reveals the genetic information that defines species, populations, and individuals. Understanding how Sanger sequencing works, and how to evaluate data quality, empowers you to critically interpret results and troubleshoot problems.Citation: Hebert, P. D. N., Cywinska, A., Ball, S. L., & deWaard, J. R. (2003). Biological identifications through DNA barcodes. Proceedings of the Royal Society B: Biological Sciences, 270(1512), 313-321.
The foundational paper proposing COI as a universal barcode for animal species identification. Demonstrated that intraspecific variation is typically much lower than interspecific variation, establishing the "barcode gap" concept. This paper catalyzed creation of the BOLD database and established Sanger sequencing as the primary method for generating barcode sequences.
Citation: Ratnasingham, S., & Hebert, P. D. N. (2007). BOLD: The Barcode of Life Data System. Molecular Ecology Notes, 7(3), 355-364.
DOI: 10.1111/j.1471-8286.2007.01678.x
Introduced BOLD as the primary platform for curating, analyzing, and sharing COI barcode sequences. Established protocols for quality control, sequence validation, and taxonomy linking. Understanding BOLD's quality control standards helps explain why sequence quality matters. As of 2025, BOLD contains over 10 million barcode sequences.
Citation: Hajibabaei, M., Singer, G. A. C., Hebert, P. D. N., & Hickey, D. A. (2007). DNA barcoding: How it complements taxonomy, molecular phylogenetics and population genetics. Trends in Genetics, 23(4), 167-172.
DOI: 10.1016/j.tig.2007.02.001 | PMID: 17316840
Explained how DNA barcoding complements (not replaces) traditional taxonomy. Demonstrated how barcode sequences can be used for both identification and phylogenetic inference. Shows that DNA barcoding is one tool among many in the molecular biologist's toolkit.
Citation: Janzen, D. H., Hajibabaei, M., Burns, J. M., Hallwachs, W., Remigio, E., & Hebert, P. D. N. (2005). Integration of DNA barcoding into an ongoing inventory of complex tropical biodiversity. Molecular Phylogenetics and Evolution, 35(1), 96-103.
DOI: 10.1016/j.ympev.2005.05.028 | PMID: 16055377
Demonstrated that Sanger-based COI barcoding could be applied to large-scale biodiversity inventories. Showed the exact workflow students follow - from insect specimen through Sanger sequencing to phylogenetic analysis. Validated that undergraduate-level techniques can produce research-quality data.
Sanger sequencing remains the gold standard for DNA barcoding because: (1) appropriate read length (650-800 bp covers the 658 bp barcode region), (2) high accuracy (>99.9% base-calling accuracy), (3) cost-effective for single genes, (4) established workflows validated across millions of specimens, (5) BOLD platform designed around Sanger chromatograms, (6) quality control through visual chromatogram inspection, and (7) bidirectional sequencing provides error checking.