mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis 71e1c91f12 Better adherence to recommendations

2025-10-21 09:43:34 -07:00

14 KiB

Raw Blame History

name, description

name	description
biopython	Molecular biology toolkit. Sequence manipulation, FASTA/GenBank I/O, NCBI Entrez, BLAST, alignments, phylogenetic trees, PDB structures, for bioinformatics workflows.

BioPython

Overview

BioPython is a comprehensive Python library for computational molecular biology and bioinformatics. This skill provides guidance on using BioPython's extensive modules for sequence manipulation, file I/O, database access, sequence similarity searches, alignments, phylogenetics, structural biology, and population genetics.

When to Use This Skill

This skill should be used when:

Working with biological sequences (DNA, RNA, protein)
Reading or writing sequence files (FASTA, GenBank, FASTQ, etc.)
Accessing NCBI databases (GenBank, PubMed, Protein, Nucleotide)
Running or parsing BLAST searches
Performing sequence alignments (pairwise or multiple)
Building or analyzing phylogenetic trees
Analyzing protein structures (PDB files)
Calculating sequence properties (GC content, melting temp, molecular weight)
Converting between sequence file formats
Performing population genetics analysis
Any bioinformatics task requiring BioPython

Core Capabilities

1. Sequence Manipulation

Create and manipulate biological sequences using Bio.Seq:

from Bio.Seq import Seq

dna_seq = Seq("ATGGTGCATCTGACT")
rna_seq = dna_seq.transcribe()           # DNA → RNA
protein = dna_seq.translate()             # DNA → Protein
rev_comp = dna_seq.reverse_complement()   # Reverse complement

Common operations:

Transcription and back-transcription
Translation with custom genetic codes
Complement and reverse complement
Sequence slicing and concatenation
Pattern searching and counting

Reference: See references/core_modules.md (section: Bio.Seq) for detailed operations and examples.

2. File Input/Output

Read and write sequence files in multiple formats using Bio.SeqIO:

from Bio import SeqIO

# Read sequences
for record in SeqIO.parse("sequences.fasta", "fasta"):
    print(record.id, len(record.seq))

# Write sequences
SeqIO.write(records, "output.gb", "genbank")

# Convert formats
SeqIO.convert("input.fasta", "fasta", "output.gb", "genbank")

Supported formats: FASTA, FASTQ, GenBank, EMBL, Swiss-Prot, PDB, Clustal, PHYLIP, NEXUS, Stockholm, and many more.

Common workflows:

Format conversion (FASTA ↔ GenBank ↔ FASTQ)
Filtering sequences by length, ID, or content
Batch processing large files with iterators
Random access with SeqIO.index() for large files

Script: Use scripts/file_io.py for file I/O examples and patterns.

Reference: See references/core_modules.md (section: Bio.SeqIO) for comprehensive format details and workflows.

3. NCBI Database Access

Access NCBI databases (GenBank, PubMed, Protein, etc.) using Bio.Entrez:

from Bio import Entrez

Entrez.email = "your.email@example.com"  # Required!

# Search database
handle = Entrez.esearch(db="nucleotide", term="human kinase", retmax=100)
record = Entrez.read(handle)
id_list = record["IdList"]

# Fetch sequences
handle = Entrez.efetch(db="nucleotide", id=id_list, rettype="fasta", retmode="text")
records = SeqIO.parse(handle, "fasta")

Key Entrez functions:

esearch(): Search databases, retrieve IDs
efetch(): Download full records
esummary(): Get document summaries
elink(): Find related records across databases
einfo(): Get database information
epost(): Upload ID lists for large queries

Important: Always set Entrez.email before using Entrez functions.

Script: Use scripts/ncbi_entrez.py for complete Entrez workflows including batch downloads and WebEnv usage.

Reference: See references/database_tools.md (section: Bio.Entrez) for detailed function documentation and parameters.

4. BLAST Searches

Run BLAST searches and parse results using Bio.Blast:

from Bio.Blast import NCBIWWW, NCBIXML

# Run BLAST online
result_handle = NCBIWWW.qblast("blastn", "nt", sequence)

# Save results
with open("blast_results.xml", "w") as out:
    out.write(result_handle.read())

# Parse results
with open("blast_results.xml") as result_handle:
    blast_record = NCBIXML.read(result_handle)

    for alignment in blast_record.alignments:
        for hsp in alignment.hsps:
            if hsp.expect < 0.001:
                print(f"Hit: {alignment.title}")
                print(f"E-value: {hsp.expect}")
                print(f"Identity: {hsp.identities}/{hsp.align_length}")

BLAST programs: blastn, blastp, blastx, tblastn, tblastx

Key result attributes:

alignment.title: Hit description
hsp.expect: E-value
hsp.identities: Number of identical residues
hsp.query, hsp.match, hsp.sbjct: Aligned sequences

Script: Use scripts/blast_search.py for complete BLAST workflows including result filtering and extraction.

Reference: See references/database_tools.md (section: Bio.Blast) for detailed parsing and filtering strategies.

5. Sequence Alignment

Perform pairwise and multiple sequence alignments using Bio.Align:

Pairwise alignment:

from Bio import Align

aligner = Align.PairwiseAligner()
aligner.mode = 'global'  # or 'local'
aligner.match_score = 2
aligner.mismatch_score = -1
aligner.gap_score = -2

alignments = aligner.align(seq1, seq2)
print(alignments[0])
print(f"Score: {alignments.score}")

Multiple sequence alignment I/O:

from Bio import AlignIO

# Read alignment
alignment = AlignIO.read("alignment.clustal", "clustal")

# Write alignment
AlignIO.write(alignment, "output.phylip", "phylip")

# Convert formats
AlignIO.convert("input.clustal", "clustal", "output.fasta", "fasta")

Supported formats: Clustal, PHYLIP, Stockholm, NEXUS, FASTA, MAF

Script: Use scripts/alignment_phylogeny.py for alignment examples and workflows.

Reference: See references/core_modules.md (sections: Bio.Align, Bio.AlignIO) for detailed alignment capabilities.

6. Phylogenetic Analysis

Build and analyze phylogenetic trees using Bio.Phylo:

from Bio import Phylo
from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor

# Read alignment
alignment = AlignIO.read("sequences.fasta", "fasta")

# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(alignment)

# Build tree (UPGMA or Neighbor-Joining)
constructor = DistanceTreeConstructor(calculator)
tree = constructor.upgma(dm)  # or constructor.nj(dm)

# Visualize tree
Phylo.draw_ascii(tree)
Phylo.draw(tree)  # matplotlib visualization

# Save tree
Phylo.write(tree, "tree.nwk", "newick")

Tree manipulation:

tree.ladderize(): Sort branches
tree.root_at_midpoint(): Root at midpoint
tree.prune(): Remove taxa
tree.collapse_all(): Collapse short branches
tree.distance(): Calculate distances between clades

Supported formats: Newick, NEXUS, PhyloXML, NeXML

Script: Use scripts/alignment_phylogeny.py for tree construction and manipulation examples.

Reference: See references/specialized_modules.md (section: Bio.Phylo) for comprehensive tree analysis capabilities.

7. Structural Bioinformatics

Analyze protein structures using Bio.PDB:

from Bio.PDB import PDBParser, PDBList

# Download structure
pdbl = PDBList()
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir=".")

# Parse structure
parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")

# Navigate hierarchy: Structure → Model → Chain → Residue → Atom
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                print(atom.name, atom.coord)

# Secondary structure with DSSP
from Bio.PDB import DSSP
dssp = DSSP(model, "structure.pdb")

# Structural alignment
from Bio.PDB import Superimposer
sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms)
print(f"RMSD: {sup.rms}")

Key capabilities:

Parse PDB, mmCIF, MMTF formats
Secondary structure analysis (DSSP)
Solvent accessibility calculations
Structural superimposition
Distance and angle calculations
Structure quality validation

Reference: See references/specialized_modules.md (section: Bio.PDB) for complete structural analysis capabilities.

8. Sequence Analysis Utilities

Calculate sequence properties using Bio.SeqUtils:

from Bio.SeqUtils import gc_fraction, MeltingTemp as mt
from Bio.SeqUtils.ProtParam import ProteinAnalysis

# DNA analysis
gc = gc_fraction(dna_seq) * 100
tm = mt.Tm_NN(dna_seq)  # Melting temperature

# Protein analysis
protein_analysis = ProteinAnalysis(str(protein_seq))
mw = protein_analysis.molecular_weight()
pi = protein_analysis.isoelectric_point()
aromaticity = protein_analysis.aromaticity()
instability = protein_analysis.instability_index()

Available analyses:

GC content and GC skew
Melting temperature (multiple methods)
Molecular weight
Isoelectric point
Aromaticity
Instability index
Secondary structure prediction
Sequence checksums

Script: Use scripts/sequence_operations.py for sequence analysis examples.

Reference: See references/core_modules.md (section: Bio.SeqUtils) for all available utilities.

9. Specialized Modules

Restriction enzymes:

from Bio import Restriction
enzyme = Restriction.EcoRI
sites = enzyme.search(seq)

Motif analysis:

from Bio import motifs
m = motifs.create([seq1, seq2, seq3])
pwm = m.counts.normalize(pseudocounts=0.5)

Population genetics: Use Bio.PopGen for allele frequencies, Hardy-Weinberg equilibrium, FST calculations.

Clustering: Use Bio.Cluster for hierarchical clustering, k-means, PCA on biological data.

Reference: See references/core_modules.md and references/specialized_modules.md for specialized module documentation.

Common Workflows

Workflow 1: Download and Analyze NCBI Sequences

Search NCBI database with Entrez.esearch()
Fetch sequences with Entrez.efetch()
Parse with SeqIO.parse()
Analyze sequences (GC content, translation, etc.)
Save results to file

Script: Use scripts/ncbi_entrez.py for complete implementation.

Workflow 2: Sequence Similarity Search

Run BLAST with NCBIWWW.qblast() or parse existing results
Parse XML results with NCBIXML.read()
Filter hits by E-value, identity, coverage
Extract and save significant hits
Perform downstream analysis

Script: Use scripts/blast_search.py for complete implementation.

Workflow 3: Phylogenetic Tree Construction

Read multiple sequence alignment with AlignIO.read()
Calculate distance matrix with DistanceCalculator
Build tree with DistanceTreeConstructor (UPGMA or NJ)
Manipulate tree (ladderize, root, prune)
Visualize with Phylo.draw() or Phylo.draw_ascii()
Save tree with Phylo.write()

Script: Use scripts/alignment_phylogeny.py for complete implementation.

Workflow 4: Format Conversion Pipeline

Read sequences in original format with SeqIO.parse()
Filter or modify sequences as needed
Write to new format with SeqIO.write()
Or use SeqIO.convert() for direct conversion

Script: Use scripts/file_io.py for format conversion examples.

Best Practices

Email Configuration

Always set Entrez.email before using NCBI services:

Entrez.email = "your.email@example.com"

Rate Limiting

Be polite to NCBI servers:

Use time.sleep() between requests
Use WebEnv for large queries
Batch downloads in reasonable chunks (100-500 sequences)

Memory Management

For large files:

Use iterators (SeqIO.parse()) instead of lists
Use SeqIO.index() for random access without loading entire file
Process in batches when possible

Error Handling

Always handle potential errors:

try:
    record = SeqIO.read(handle, format)
except Exception as e:
    print(f"Error: {e}")

File Format Selection

Choose appropriate formats:

FASTA: Simple sequences, no annotations
GenBank: Rich annotations, features, references
FASTQ: Sequences with quality scores
PDB: 3D structural data

Resources

scripts/

Executable Python scripts demonstrating common BioPython workflows:

sequence_operations.py: Basic sequence manipulation (transcription, translation, complement, GC content, melting temp)
file_io.py: Reading, writing, and converting sequence files; filtering; indexing large files
ncbi_entrez.py: Searching and downloading from NCBI databases; batch processing with WebEnv
blast_search.py: Running BLAST searches online; parsing and filtering results
alignment_phylogeny.py: Pairwise and multiple sequence alignment; phylogenetic tree construction and manipulation

Run any script with python3 scripts/<script_name>.py to see examples.

references/

Comprehensive reference documentation for BioPython modules:

core_modules.md: Core sequence handling (Seq, SeqRecord, SeqIO, AlignIO, Align, SeqUtils, CodonTable, motifs, Restriction)
database_tools.md: Database access and searches (Entrez, BLAST, SearchIO, BioSQL)
specialized_modules.md: Advanced analyses (PDB, Phylo, PAML, PopGen, Cluster, Graphics)

Reference these files when:

Learning about specific module capabilities
Looking up function parameters and options
Understanding supported file formats
Finding example code patterns

Use grep to search references for specific topics:

grep -n "secondary structure" references/specialized_modules.md
grep -n "efetch" references/database_tools.md

Additional Resources

Official Documentation: https://biopython.org/docs/latest/

Tutorial: https://biopython.org/docs/latest/Tutorial/index.html

API Reference: https://biopython.org/docs/latest/api/index.html

Cookbook: https://biopython.org/wiki/Category:Cookbook

14 KiB Raw Blame History