Files
claude-scientific-skills/scientific-packages/biopython/SKILL.md
2025-10-19 14:12:02 -07:00

451 lines
14 KiB
Markdown

---
name: biopython
description: Comprehensive toolkit for computational molecular biology using BioPython. Use this skill when working with biological sequences (DNA, RNA, protein), parsing sequence files (FASTA, GenBank, FASTQ), accessing NCBI databases (Entrez, BLAST), performing sequence alignments, building phylogenetic trees, analyzing protein structures (PDB), or any bioinformatics task requiring BioPython modules.
---
# BioPython
## Overview
BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.
## When to Use This Skill
Use this skill when:
- Working with biological sequences (DNA, RNA, protein)
- Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
- Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
- Running or parsing BLAST searches
- Performing sequence alignments (pairwise or multiple)
- Building or analyzing phylogenetic trees
- Analyzing protein structures (PDB files)
- Calculating sequence properties (GC content, melting temp, molecular weight)
- Converting between sequence file formats
- Performing population genetics analysis
- Any bioinformatics task requiring BioPython
## Core Capabilities
### 1. Sequence Manipulation
Create and manipulate biological sequences using `Bio.Seq`:
```python
from Bio.Seq import Seq
dna_seq = Seq("ATGGTGCATCTGACT")
rna_seq = dna_seq.transcribe() # DNA → RNA
protein = dna_seq.translate() # DNA → Protein
rev_comp = dna_seq.reverse_complement() # Reverse complement
```
**Common operations:**
- Transcription and back-transcription
- Translation with custom genetic codes
- Complement and reverse complement
- Sequence slicing and concatenation
- Pattern searching and counting
**Reference:** See `references/core_modules.md` (section: Bio.Seq) for detailed operations and examples.
### 2. File Input/Output
Read and write sequence files in multiple formats using `Bio.SeqIO`:
```python
from Bio import SeqIO
# Read sequences
for record in SeqIO.parse("sequences.fasta", "fasta"):
print(record.id, len(record.seq))
# Write sequences
SeqIO.write(records, "output.gb", "genbank")
# Convert formats
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")
```
**Supported formats:** FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.
**Common workflows:**
- Format conversion (FASTA ↔ GenBank ↔ FASTQ)
- Filtering sequences by length, ID, or content
- Batch processing large files with iterators
- Random access with `SeqIO.index()` for large files
**Script:** Use `scripts/file_io.py` for file I/O examples and patterns.
**Reference:** See `references/core_modules.md` (section: Bio.SeqIO) for comprehensive format details and workflows.
### 3. NCBI Database Access
Access NCBI databases (GenBank, PubMed, Protein, etc.) using `Bio.Entrez`:
```python
from Bio import Entrez
Entrez.email = "your.email@example.com" # Required!
# Search database
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
record = Entrez.read(handle)
id_list = record["IdList"]
# Fetch sequences
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")
```
**Key Entrez functions:**
- `esearch()`: Search databases, retrieve IDs
- `efetch()`: Download full records
- `esummary()`: Get document summaries
- `elink()`: Find related records across databases
- `einfo()`: Get database information
- `epost()`: Upload ID lists for large queries
**Important:** Always set `Entrez.email` before using Entrez functions.
**Script:** Use `scripts/ncbi_entrez.py` for complete Entrez workflows including batch downloads and WebEnv usage.
**Reference:** See `references/database_tools.md` (section: Bio.Entrez) for detailed function documentation and parameters.
### 4. BLAST Searches
Run BLAST searches and parse results using `Bio.Blast`:
```python
from Bio.Blast import NCBIWWW, NCBIXML
# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
# Save results
with open("blast_results.xml", "w") as out:
out.write(result_handle.read())
# Parse results
with open("blast_results.xml") as result_handle:
blast_record = NCBIXML.read(result_handle)
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 0.001:
print(f"Hit: {alignment.title}")
print(f"E-value: {hsp.expect}")
print(f"Identity: {hsp.identities}/{hsp.align_length}")
```
**BLAST programs:** blastn, blastp, blastx, tblastn, tblastx
**Key result attributes:**
- `alignment.title`: Hit description
- `hsp.expect`: E-value
- `hsp.identities`: Number of identical residues
- `hsp.query`, `hsp.match`, `hsp.sbjct`: Aligned sequences
**Script:** Use `scripts/blast_search.py` for complete BLAST workflows including result filtering and extraction.
**Reference:** See `references/database_tools.md` (section: Bio.Blast) for detailed parsing and filtering strategies.
### 5. Sequence Alignment
Perform pairwise and multiple sequence alignments using `Bio.Align`:
**Pairwise alignment:**
```python
from Bio import Align
aligner = Align.PairwiseAligner()
aligner.mode = 'global' # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2
alignments = aligner.align(seq1, seq2)
print(alignments[0])
print(f"Score: {alignments.score}")
```
**Multiple sequence alignment I/O:**
```python
from Bio import AlignIO
# Read alignment
alignment = AlignIO.read("alignment.clustal", "clustal")
# Write alignment
AlignIO.write(alignment, "output.phylip", "phylip")
# Convert formats
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")
```
**Supported formats:** Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF
**Script:** Use `scripts/alignment_phylogeny.py` for alignment examples and workflows.
**Reference:** See `references/core_modules.md` (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.
### 6. Phylogenetic Analysis
Build and analyze phylogenetic trees using `Bio.Phylo`:
```python
from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
# Read alignment
alignment = AlignIO.read("sequences.fasta", "fasta")
# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)
# Build tree (UPGMA or Neighbor-Joining)
constructor = DistanceTreeConstructor(calculator)
tree = constructor.upgma(dm) # or constructor.nj(dm)
# Visualize tree
Phylo.draw_ascii(tree)
Phylo.draw(tree) # matplotlib visualization
# Save tree
Phylo.write(tree, "tree.nwk", "newick")
```
**Tree manipulation:**
- `tree.ladderize()`: Sort branches
- `tree.root_at_midpoint()`: Root at midpoint
- `tree.prune()`: Remove taxa
- `tree.collapse_all()`: Collapse short branches
- `tree.distance()`: Calculate distances between clades
**Supported formats:** Newick, NEXUS, PhyloXML, NeXML
**Script:** Use `scripts/alignment_phylogeny.py` for tree construction and manipulation examples.
**Reference:** See `references/specialized_modules.md` (section: Bio.Phylo) for comprehensive tree analysis capabilities.
### 7. Structural Bioinformatics
Analyze protein structures using `Bio.PDB`:
```python
from Bio.PDB import PDBParser, PDBList
# Download structure
pdbl = PDBList()
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")
# Parse structure
parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")
# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
for model in structure:
for chain in model:
for residue in chain:
for atom in residue:
print(atom.name, atom.coord)
# Secondary structure with DSSP
from Bio.PDB import DSSP
dssp = DSSP(model, "structure.pdb")
# Structural alignment
from Bio.PDB import Superimposer
sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms)
print(f"RMSD: {sup.rms}")
```
**Key capabilities:**
- Parse PDB, mmCIF, MMTF formats
- Secondary structure analysis (DSSP)
- Solvent accessibility calculations
- Structural superimposition
- Distance and angle calculations
- Structure quality validation
**Reference:** See `references/specialized_modules.md` (section: Bio.PDB) for complete structural analysis capabilities.
### 8. Sequence Analysis Utilities
Calculate sequence properties using `Bio.SeqUtils`:
```python
from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
from Bio.SeqUtils.ProtParam import ProteinAnalysis
# DNA analysis
gc = gc_fraction(dna_seq) * 100
tm = mt.Tm_NN(dna_seq) # Melting temperature
# Protein analysis
protein_analysis = ProteinAnalysis(str(protein_seq))
mw = protein_analysis.molecular_weight()
pi = protein_analysis.isoelectric_point()
aromaticity = protein_analysis.aromaticity()
instability = protein_analysis.instability_index()
```
**Available analyses:**
- GC content and GC skew
- Melting temperature (multiple methods)
- Molecular weight
- Isoelectric point
- Aromaticity
- Instability index
- Secondary structure prediction
- Sequence checksums
**Script:** Use `scripts/sequence_operations.py` for sequence analysis examples.
**Reference:** See `references/core_modules.md` (section: Bio.SeqUtils) for all available utilities.
### 9. Specialized Modules
**Restriction enzymes:**
```python
from Bio import Restriction
enzyme = Restriction.EcoRI
sites = enzyme.search(seq)
```
**Motif analysis:**
```python
from Bio import motifs
m = motifs.create([seq1, seq2, seq3])
pwm = m.counts.normalize(pseudocounts=0.5)
```
**Population genetics:**
Use `Bio.PopGen` for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.
**Clustering:**
Use `Bio.Cluster` for hierarchical clustering, k-means, PCA on biological data.
**Reference:** See `references/core_modules.md` and `references/specialized_modules.md` for specialized module documentation.
## Common Workflows
### Workflow 1: Download and Analyze NCBI Sequences
1. Search NCBI database with `Entrez.esearch()`
2. Fetch sequences with `Entrez.efetch()`
3. Parse with `SeqIO.parse()`
4. Analyze sequences (GC content, translation, etc.)
5. Save results to file
**Script:** Use `scripts/ncbi_entrez.py` for complete implementation.
### Workflow 2: Sequence Similarity Search
1. Run BLAST with `NCBIWWW.qblast()` or parse existing results
2. Parse XML results with `NCBIXML.read()`
3. Filter hits by E-value, identity, coverage
4. Extract and save significant hits
5. Perform downstream analysis
**Script:** Use `scripts/blast_search.py` for complete implementation.
### Workflow 3: Phylogenetic Tree Construction
1. Read multiple sequence alignment with `AlignIO.read()`
2. Calculate distance matrix with `DistanceCalculator`
3. Build tree with `DistanceTreeConstructor` (UPGMA or NJ)
4. Manipulate tree (ladderize, root, prune)
5. Visualize with `Phylo.draw()` or `Phylo.draw_ascii()`
6. Save tree with `Phylo.write()`
**Script:** Use `scripts/alignment_phylogeny.py` for complete implementation.
### Workflow 4: Format Conversion Pipeline
1. Read sequences in original format with `SeqIO.parse()`
2. Filter or modify sequences as needed
3. Write to new format with `SeqIO.write()`
4. Or use `SeqIO.convert()` for direct conversion
**Script:** Use `scripts/file_io.py` for format conversion examples.
## Best Practices
### Email Configuration
Always set `Entrez.email` before using NCBI services:
```python
Entrez.email = "your.email@example.com"
```
### Rate Limiting
Be polite to NCBI servers:
- Use `time.sleep()` between requests
- Use WebEnv for large queries
- Batch downloads in reasonable chunks (100-500 sequences)
### Memory Management
For large files:
- Use iterators (`SeqIO.parse()`) instead of lists
- Use `SeqIO.index()` for random access without loading entire file
- Process in batches when possible
### Error Handling
Always handle potential errors:
```python
try:
record = SeqIO.read(handle, format)
except Exception as e:
print(f"Error: {e}")
```
### File Format Selection
Choose appropriate formats:
- FASTA: Simple sequences, no annotations
- GenBank: Rich annotations, features, references
- FASTQ: Sequences with quality scores
- PDB: 3D structural data
## Resources
### scripts/
Executable Python scripts demonstrating common BioPython workflows:
- `sequence_operations.py`: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)
- `file_io.py`: Reading, writing, and converting sequence files; filtering; indexing large files
- `ncbi_entrez.py`: Searching and downloading from NCBI databases; batch processing with WebEnv
- `blast_search.py`: Running BLAST searches online; parsing and filtering results
- `alignment_phylogeny.py`: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation
Run any script with `python3 scripts/<script_name>.py` to see examples.
### references/
Comprehensive reference documentation for BioPython modules:
- `core_modules.md`: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)
- `database_tools.md`: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)
- `specialized_modules.md`: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)
Reference these files when:
- Learning about specific module capabilities
- Looking up function parameters and options
- Understanding supported file formats
- Finding example code patterns
Use `grep` to search references for specific topics:
```bash
grep -n "secondary structure" references/specialized_modules.md
grep -n "efetch" references/database_tools.md
```
## Additional Resources
**Official Documentation:** https://biopython.org/docs/latest/
**Tutorial:** https://biopython.org/docs/latest/Tutorial/index.html
**API Reference:** https://biopython.org/docs/latest/api/index.html
**Cookbook:** https://biopython.org/wiki/Category:Cookbook