14 KiB
name, description
| name | description |
|---|---|
| biopython | Molecular biology toolkit. Sequence manipulation, FASTA/GenBank I/O, NCBI Entrez, BLAST, alignments, phylogenetic trees, PDB structures, for bioinformatics workflows. |
BioPython
Overview
BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.
When to Use This Skill
This skill should be used when:
- Working with biological sequences (DNA, RNA, protein)
- Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
- Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
- Running or parsing BLAST searches
- Performing sequence alignments (pairwise or multiple)
- Building or analyzing phylogenetic trees
- Analyzing protein structures (PDB files)
- Calculating sequence properties (GC content, melting temp, molecular weight)
- Converting between sequence file formats
- Performing population genetics analysis
- Any bioinformatics task requiring BioPython
Core Capabilities
1. Sequence Manipulation
Create and manipulate biological sequences using Bio.Seq:
from Bio.Seq import Seq
dna_seq = Seq("ATGGTGCATCTGACT")
rna_seq = dna_seq.transcribe() # DNA → RNA
protein = dna_seq.translate() # DNA → Protein
rev_comp = dna_seq.reverse_complement() # Reverse complement
Common operations:
- Transcription and back-transcription
- Translation with custom genetic codes
- Complement and reverse complement
- Sequence slicing and concatenation
- Pattern searching and counting
Reference: See references/core_modules.md (section: Bio.Seq) for detailed operations and examples.
2. File Input/Output
Read and write sequence files in multiple formats using Bio.SeqIO:
from Bio import SeqIO
# Read sequences
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(record.id, len(record.seq))
# Write sequences
SeqIO.write(records, "output.gb", "genbank")
# Convert formats
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")
Supported formats: FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.
Common workflows:
- Format conversion (FASTA ↔ GenBank ↔ FASTQ)
- Filtering sequences by length, ID, or content
- Batch processing large files with iterators
- Random access with
SeqIO.index()for large files
Script: Use scripts/file_io.py for file I/O examples and patterns.
Reference: See references/core_modules.md (section: Bio.SeqIO) for comprehensive format details and workflows.
3. NCBI Database Access
Access NCBI databases (GenBank, PubMed, Protein, etc.) using Bio.Entrez:
from Bio import Entrez
Entrez.email = "your.email@example.com" # Required!
# Search database
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
record = Entrez.read(handle)
id_list = record["IdList"]
# Fetch sequences
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Key Entrez functions:
esearch(): Search databases, retrieve IDsefetch(): Download full recordsesummary(): Get document summarieselink(): Find related records across databaseseinfo(): Get database informationepost(): Upload ID lists for large queries
Important: Always set Entrez.email before using Entrez functions.
Script: Use scripts/ncbi_entrez.py for complete Entrez workflows including batch downloads and WebEnv usage.
Reference: See references/database_tools.md (section: Bio.Entrez) for detailed function documentation and parameters.
4. BLAST Searches
Run BLAST searches and parse results using Bio.Blast:
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
# Save results
with open("blast_results.xml", "w") as out:
out.write(result_handle.read())
# Parse results
with open("blast_results.xml") as result_handle:
blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"E-value: {hsp.expect}")
print(f"Identity: {hsp.identities}/{hsp.align_length}")
BLAST programs: blastn, blastp, blastx, tblastn, tblastx
Key result attributes:
alignment.title: Hit descriptionhsp.expect: E-valuehsp.identities: Number of identical residueshsp.query,hsp.match,hsp.sbjct: Aligned sequences
Script: Use scripts/blast_search.py for complete BLAST workflows including result filtering and extraction.
Reference: See references/database_tools.md (section: Bio.Blast) for detailed parsing and filtering strategies.
5. Sequence Alignment
Perform pairwise and multiple sequence alignments using Bio.Align:
Pairwise alignment:
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global' # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2
alignments = aligner.align(seq1, seq2)
print(alignments[0])
print(f"Score: {alignments.score}")
Multiple sequence alignment I/O:
from Bio import AlignIO
# Read alignment
alignment = AlignIO.read("alignment.clustal", "clustal")
# Write alignment
AlignIO.write(alignment, "output.phylip", "phylip")
# Convert formats
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")
Supported formats: Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF
Script: Use scripts/alignment_phylogeny.py for alignment examples and workflows.
Reference: See references/core_modules.md (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.
6. Phylogenetic Analysis
Build and analyze phylogenetic trees using Bio.Phylo:
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
# Read alignment
alignment = AlignIO.read("sequences.fasta", "fasta")
# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
# Build tree (UPGMA or Neighbor-Joining)
constructor = DistanceTreeConstructor(calculator)
tree = constructor.upgma(dm) # or constructor.nj(dm)
# Visualize tree
Phylo.draw_ascii(tree)
Phylo.draw(tree) # matplotlib visualization
# Save tree
Phylo.write(tree, "tree.nwk", "newick")
Tree manipulation:
tree.ladderize(): Sort branchestree.root_at_midpoint(): Root at midpointtree.prune(): Remove taxatree.collapse_all(): Collapse short branchestree.distance(): Calculate distances between clades
Supported formats: Newick, NEXUS, PhyloXML, NeXML
Script: Use scripts/alignment_phylogeny.py for tree construction and manipulation examples.
Reference: See references/specialized_modules.md (section: Bio.Phylo) for comprehensive tree analysis capabilities.
7. Structural Bioinformatics
Analyze protein structures using Bio.PDB:
from Bio.PDB import PDBParser, PDBList
# Download structure
pdbl = PDBList()
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")
# Parse structure
parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")
# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
print(atom.name, atom.coord)
# Secondary structure with DSSP
from Bio.PDB import DSSP
dssp = DSSP(model, "structure.pdb")
# Structural alignment
from Bio.PDB import Superimposer
sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms)
print(f"RMSD: {sup.rms}")
Key capabilities:
- Parse PDB, mmCIF, MMTF formats
- Secondary structure analysis (DSSP)
- Solvent accessibility calculations
- Structural superimposition
- Distance and angle calculations
- Structure quality validation
Reference: See references/specialized_modules.md (section: Bio.PDB) for complete structural analysis capabilities.
8. Sequence Analysis Utilities
Calculate sequence properties using Bio.SeqUtils:
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# DNA analysis
gc = gc_fraction(dna_seq) * 100
tm = mt.Tm_NN(dna_seq) # Melting temperature
# Protein analysis
protein_analysis = ProteinAnalysis(str(protein_seq))
mw = protein_analysis.molecular_weight()
pi = protein_analysis.isoelectric_point()
aromaticity = protein_analysis.aromaticity()
instability = protein_analysis.instability_index()
Available analyses:
- GC content and GC skew
- Melting temperature (multiple methods)
- Molecular weight
- Isoelectric point
- Aromaticity
- Instability index
- Secondary structure prediction
- Sequence checksums
Script: Use scripts/sequence_operations.py for sequence analysis examples.
Reference: See references/core_modules.md (section: Bio.SeqUtils) for all available utilities.
9. Specialized Modules
Restriction enzymes:
from Bio import Restriction
enzyme = Restriction.EcoRI
sites = enzyme.search(seq)
Motif analysis:
from Bio import motifs
m = motifs.create([seq1, seq2, seq3])
pwm = m.counts.normalize(pseudocounts=0.5)
Population genetics:
Use Bio.PopGen for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.
Clustering:
Use Bio.Cluster for hierarchical clustering, k-means, PCA on biological data.
Reference: See references/core_modules.md and references/specialized_modules.md for specialized module documentation.
Common Workflows
Workflow 1: Download and Analyze NCBI Sequences
- Search NCBI database with
Entrez.esearch() - Fetch sequences with
Entrez.efetch() - Parse with
SeqIO.parse() - Analyze sequences (GC content, translation, etc.)
- Save results to file
Script: Use scripts/ncbi_entrez.py for complete implementation.
Workflow 2: Sequence Similarity Search
- Run BLAST with
NCBIWWW.qblast()or parse existing results - Parse XML results with
NCBIXML.read() - Filter hits by E-value, identity, coverage
- Extract and save significant hits
- Perform downstream analysis
Script: Use scripts/blast_search.py for complete implementation.
Workflow 3: Phylogenetic Tree Construction
- Read multiple sequence alignment with
AlignIO.read() - Calculate distance matrix with
DistanceCalculator - Build tree with
DistanceTreeConstructor(UPGMA or NJ) - Manipulate tree (ladderize, root, prune)
- Visualize with
Phylo.draw()orPhylo.draw_ascii() - Save tree with
Phylo.write()
Script: Use scripts/alignment_phylogeny.py for complete implementation.
Workflow 4: Format Conversion Pipeline
- Read sequences in original format with
SeqIO.parse() - Filter or modify sequences as needed
- Write to new format with
SeqIO.write() - Or use
SeqIO.convert()for direct conversion
Script: Use scripts/file_io.py for format conversion examples.
Best Practices
Email Configuration
Always set Entrez.email before using NCBI services:
Entrez.email = "your.email@example.com"
Rate Limiting
Be polite to NCBI servers:
- Use
time.sleep()between requests - Use WebEnv for large queries
- Batch downloads in reasonable chunks (100-500 sequences)
Memory Management
For large files:
- Use iterators (
SeqIO.parse()) instead of lists - Use
SeqIO.index()for random access without loading entire file - Process in batches when possible
Error Handling
Always handle potential errors:
try:
record = SeqIO.read(handle, format)
except Exception as e:
print(f"Error: {e}")
File Format Selection
Choose appropriate formats:
- FASTA: Simple sequences, no annotations
- GenBank: Rich annotations, features, references
- FASTQ: Sequences with quality scores
- PDB: 3D structural data
Resources
scripts/
Executable Python scripts demonstrating common BioPython workflows:
sequence_operations.py: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)file_io.py: Reading, writing, and converting sequence files; filtering; indexing large filesncbi_entrez.py: Searching and downloading from NCBI databases; batch processing with WebEnvblast_search.py: Running BLAST searches online; parsing and filtering resultsalignment_phylogeny.py: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation
Run any script with python3 scripts/<script_name>.py to see examples.
references/
Comprehensive reference documentation for BioPython modules:
core_modules.md: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)database_tools.md: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)specialized_modules.md: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)
Reference these files when:
- Learning about specific module capabilities
- Looking up function parameters and options
- Understanding supported file formats
- Finding example code patterns
Use grep to search references for specific topics:
grep -n "secondary structure" references/specialized_modules.md
grep -n "efetch" references/database_tools.md
Additional Resources
Official Documentation: https://biopython.org/docs/latest/
Tutorial: https://biopython.org/docs/latest/Tutorial/index.html
API Reference: https://biopython.org/docs/latest/api/index.html