14 KiB
name, description
| name | description |
|---|---|
| biopython | Use BioPython for computational molecular biology and bioinformatics tasks. Essential for: sequence manipulation (DNA/RNA/protein transcription, translation, complement, reverse complement), reading/writing biological file formats (FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS), NCBI database access (Entrez searches, downloads from GenBank/PubMed/Protein databases), BLAST sequence similarity searches and result parsing, pairwise and multiple sequence alignments, phylogenetic tree construction and analysis (UPGMA, Neighbor-Joining), protein structure analysis (PDB parsing, secondary structure, structural alignment), sequence property calculations (GC content, melting temperature, molecular weight, isoelectric point), format conversion between biological file types, restriction enzyme analysis, motif discovery, population genetics calculations, and any task requiring Bio.Seq, Bio.SeqIO, Bio.Entrez, Bio.Blast, Bio.Align, Bio.Phylo, Bio.PDB, Bio.SeqUtils, or other BioPython modules. |
BioPython
Overview
BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.
When to Use This Skill
Use this skill when:
- Working with biological sequences (DNA, RNA, protein)
- Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
- Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
- Running or parsing BLAST searches
- Performing sequence alignments (pairwise or multiple)
- Building or analyzing phylogenetic trees
- Analyzing protein structures (PDB files)
- Calculating sequence properties (GC content, melting temp, molecular weight)
- Converting between sequence file formats
- Performing population genetics analysis
- Any bioinformatics task requiring BioPython
Core Capabilities
1. Sequence Manipulation
Create and manipulate biological sequences using Bio.Seq:
from Bio.Seq import Seq
dna_seq = Seq("ATGGTGCATCTGACT")
rna_seq = dna_seq.transcribe() # DNA → RNA
protein = dna_seq.translate() # DNA → Protein
rev_comp = dna_seq.reverse_complement() # Reverse complement
Common operations:
- Transcription and back-transcription
- Translation with custom genetic codes
- Complement and reverse complement
- Sequence slicing and concatenation
- Pattern searching and counting
Reference: See references/core_modules.md (section: Bio.Seq) for detailed operations and examples.
2. File Input/Output
Read and write sequence files in multiple formats using Bio.SeqIO:
from Bio import SeqIO
# Read sequences
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(record.id, len(record.seq))
# Write sequences
SeqIO.write(records, "output.gb", "genbank")
# Convert formats
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")
Supported formats: FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.
Common workflows:
- Format conversion (FASTA ↔ GenBank ↔ FASTQ)
- Filtering sequences by length, ID, or content
- Batch processing large files with iterators
- Random access with
SeqIO.index()for large files
Script: Use scripts/file_io.py for file I/O examples and patterns.
Reference: See references/core_modules.md (section: Bio.SeqIO) for comprehensive format details and workflows.
3. NCBI Database Access
Access NCBI databases (GenBank, PubMed, Protein, etc.) using Bio.Entrez:
from Bio import Entrez
Entrez.email = "your.email@example.com" # Required!
# Search database
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
record = Entrez.read(handle)
id_list = record["IdList"]
# Fetch sequences
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
Key Entrez functions:
esearch(): Search databases, retrieve IDsefetch(): Download full recordsesummary(): Get document summarieselink(): Find related records across databaseseinfo(): Get database informationepost(): Upload ID lists for large queries
Important: Always set Entrez.email before using Entrez functions.
Script: Use scripts/ncbi_entrez.py for complete Entrez workflows including batch downloads and WebEnv usage.
Reference: See references/database_tools.md (section: Bio.Entrez) for detailed function documentation and parameters.
4. BLAST Searches
Run BLAST searches and parse results using Bio.Blast:
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
# Save results
with open("blast_results.xml", "w") as out:
out.write(result_handle.read())
# Parse results
with open("blast_results.xml") as result_handle:
blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"E-value: {hsp.expect}")
print(f"Identity: {hsp.identities}/{hsp.align_length}")
BLAST programs: blastn, blastp, blastx, tblastn, tblastx
Key result attributes:
alignment.title: Hit descriptionhsp.expect: E-valuehsp.identities: Number of identical residueshsp.query,hsp.match,hsp.sbjct: Aligned sequences
Script: Use scripts/blast_search.py for complete BLAST workflows including result filtering and extraction.
Reference: See references/database_tools.md (section: Bio.Blast) for detailed parsing and filtering strategies.
5. Sequence Alignment
Perform pairwise and multiple sequence alignments using Bio.Align:
Pairwise alignment:
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global' # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2
alignments = aligner.align(seq1, seq2)
print(alignments[0])
print(f"Score: {alignments.score}")
Multiple sequence alignment I/O:
from Bio import AlignIO
# Read alignment
alignment = AlignIO.read("alignment.clustal", "clustal")
# Write alignment
AlignIO.write(alignment, "output.phylip", "phylip")
# Convert formats
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")
Supported formats: Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF
Script: Use scripts/alignment_phylogeny.py for alignment examples and workflows.
Reference: See references/core_modules.md (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.
6. Phylogenetic Analysis
Build and analyze phylogenetic trees using Bio.Phylo:
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
# Read alignment
alignment = AlignIO.read("sequences.fasta", "fasta")
# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
# Build tree (UPGMA or Neighbor-Joining)
constructor = DistanceTreeConstructor(calculator)
tree = constructor.upgma(dm) # or constructor.nj(dm)
# Visualize tree
Phylo.draw_ascii(tree)
Phylo.draw(tree) # matplotlib visualization
# Save tree
Phylo.write(tree, "tree.nwk", "newick")
Tree manipulation:
tree.ladderize(): Sort branchestree.root_at_midpoint(): Root at midpointtree.prune(): Remove taxatree.collapse_all(): Collapse short branchestree.distance(): Calculate distances between clades
Supported formats: Newick, NEXUS, PhyloXML, NeXML
Script: Use scripts/alignment_phylogeny.py for tree construction and manipulation examples.
Reference: See references/specialized_modules.md (section: Bio.Phylo) for comprehensive tree analysis capabilities.
7. Structural Bioinformatics
Analyze protein structures using Bio.PDB:
from Bio.PDB import PDBParser, PDBList
# Download structure
pdbl = PDBList()
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")
# Parse structure
parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")
# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
print(atom.name, atom.coord)
# Secondary structure with DSSP
from Bio.PDB import DSSP
dssp = DSSP(model, "structure.pdb")
# Structural alignment
from Bio.PDB import Superimposer
sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms)
print(f"RMSD: {sup.rms}")
Key capabilities:
- Parse PDB, mmCIF, MMTF formats
- Secondary structure analysis (DSSP)
- Solvent accessibility calculations
- Structural superimposition
- Distance and angle calculations
- Structure quality validation
Reference: See references/specialized_modules.md (section: Bio.PDB) for complete structural analysis capabilities.
8. Sequence Analysis Utilities
Calculate sequence properties using Bio.SeqUtils:
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# DNA analysis
gc = gc_fraction(dna_seq) * 100
tm = mt.Tm_NN(dna_seq) # Melting temperature
# Protein analysis
protein_analysis = ProteinAnalysis(str(protein_seq))
mw = protein_analysis.molecular_weight()
pi = protein_analysis.isoelectric_point()
aromaticity = protein_analysis.aromaticity()
instability = protein_analysis.instability_index()
Available analyses:
- GC content and GC skew
- Melting temperature (multiple methods)
- Molecular weight
- Isoelectric point
- Aromaticity
- Instability index
- Secondary structure prediction
- Sequence checksums
Script: Use scripts/sequence_operations.py for sequence analysis examples.
Reference: See references/core_modules.md (section: Bio.SeqUtils) for all available utilities.
9. Specialized Modules
Restriction enzymes:
from Bio import Restriction
enzyme = Restriction.EcoRI
sites = enzyme.search(seq)
Motif analysis:
from Bio import motifs
m = motifs.create([seq1, seq2, seq3])
pwm = m.counts.normalize(pseudocounts=0.5)
Population genetics:
Use Bio.PopGen for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.
Clustering:
Use Bio.Cluster for hierarchical clustering, k-means, PCA on biological data.
Reference: See references/core_modules.md and references/specialized_modules.md for specialized module documentation.
Common Workflows
Workflow 1: Download and Analyze NCBI Sequences
- Search NCBI database with
Entrez.esearch() - Fetch sequences with
Entrez.efetch() - Parse with
SeqIO.parse() - Analyze sequences (GC content, translation, etc.)
- Save results to file
Script: Use scripts/ncbi_entrez.py for complete implementation.
Workflow 2: Sequence Similarity Search
- Run BLAST with
NCBIWWW.qblast()or parse existing results - Parse XML results with
NCBIXML.read() - Filter hits by E-value, identity, coverage
- Extract and save significant hits
- Perform downstream analysis
Script: Use scripts/blast_search.py for complete implementation.
Workflow 3: Phylogenetic Tree Construction
- Read multiple sequence alignment with
AlignIO.read() - Calculate distance matrix with
DistanceCalculator - Build tree with
DistanceTreeConstructor(UPGMA or NJ) - Manipulate tree (ladderize, root, prune)
- Visualize with
Phylo.draw()orPhylo.draw_ascii() - Save tree with
Phylo.write()
Script: Use scripts/alignment_phylogeny.py for complete implementation.
Workflow 4: Format Conversion Pipeline
- Read sequences in original format with
SeqIO.parse() - Filter or modify sequences as needed
- Write to new format with
SeqIO.write() - Or use
SeqIO.convert()for direct conversion
Script: Use scripts/file_io.py for format conversion examples.
Best Practices
Email Configuration
Always set Entrez.email before using NCBI services:
Entrez.email = "your.email@example.com"
Rate Limiting
Be polite to NCBI servers:
- Use
time.sleep()between requests - Use WebEnv for large queries
- Batch downloads in reasonable chunks (100-500 sequences)
Memory Management
For large files:
- Use iterators (
SeqIO.parse()) instead of lists - Use
SeqIO.index()for random access without loading entire file - Process in batches when possible
Error Handling
Always handle potential errors:
try:
record = SeqIO.read(handle, format)
except Exception as e:
print(f"Error: {e}")
File Format Selection
Choose appropriate formats:
- FASTA: Simple sequences, no annotations
- GenBank: Rich annotations, features, references
- FASTQ: Sequences with quality scores
- PDB: 3D structural data
Resources
scripts/
Executable Python scripts demonstrating common BioPython workflows:
sequence_operations.py: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)file_io.py: Reading, writing, and converting sequence files; filtering; indexing large filesncbi_entrez.py: Searching and downloading from NCBI databases; batch processing with WebEnvblast_search.py: Running BLAST searches online; parsing and filtering resultsalignment_phylogeny.py: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation
Run any script with python3 scripts/<script_name>.py to see examples.
references/
Comprehensive reference documentation for BioPython modules:
core_modules.md: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)database_tools.md: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)specialized_modules.md: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)
Reference these files when:
- Learning about specific module capabilities
- Looking up function parameters and options
- Understanding supported file formats
- Finding example code patterns
Use grep to search references for specific topics:
grep -n "secondary structure" references/specialized_modules.md
grep -n "efetch" references/database_tools.md
Additional Resources
Official Documentation: https://biopython.org/docs/latest/
Tutorial: https://biopython.org/docs/latest/Tutorial/index.html
API Reference: https://biopython.org/docs/latest/api/index.html