claude-scientific-skills/scientific-packages/biopython/references/specialized_modules.md

# BioPython Specialized Analysis Modules

This document covers BioPython's specialized modules for structural biology, phylogenetics, population genetics, and other advanced analyses.

## Structural Bioinformatics

### Bio.PDB - Protein Structure Analysis

Comprehensive tools for handling macromolecular crystal structures.

#### Structure Hierarchy

PDB structures are organized hierarchically:
- **Structure** → Models → Chains → Residues → Atoms

```python
from Bio.PDB import PDBParser

parser = PDBParser()
structure = parser.get_structure("protein", "1abc.pdb")

# Navigate hierarchy
for model in structure:
    for chain in model:
        for residue in chain:
            for atom in residue:
                print(atom.coord)  # xyz coordinates
```

#### Parsing Structure Files

**PDB format:**
```python
from Bio.PDB import PDBParser
parser = PDBParser(QUIET=True)
structure = parser.get_structure("myprotein", "structure.pdb")
```

**mmCIF format:**
```python
from Bio.PDB import MMCIFParser
parser = MMCIFParser(QUIET=True)
structure = parser.get_structure("myprotein", "structure.cif")
```

**Fast mmCIF parser:**
```python
from Bio.PDB import FastMMCIFParser
parser = FastMMCIFParser(QUIET=True)
structure = parser.get_structure("myprotein", "structure.cif")
```

**MMTF format:**
```python
from Bio.PDB import MMTFParser
parser = MMTFParser()
structure = parser.get_structure("structure.mmtf")
```

**Binary CIF:**
```python
from Bio.PDB.binary_cif import BinaryCIFParser
parser = BinaryCIFParser()
structure = parser.get_structure("structure.bcif")
```

#### Downloading Structures

```python
from Bio.PDB import PDBList
pdbl = PDBList()

# Download specific structure
pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir="structures/")

# Download entire PDB (obsolete entries)
pdbl.download_obsolete_entries(pdir="obsolete/")

# Update local PDB mirror
pdbl.update_pdb()
```

#### Structure Selection and Filtering

```python
# Select specific chains
chain_A = structure[0]['A']

# Select specific residues
residue_10 = chain_A[10]

# Select specific atoms
ca_atom = residue_10['CA']

# Iterate over specific atom types
for atom in structure.get_atoms():
    if atom.name == 'CA':  # Alpha carbons only
        print(atom.coord)
```

**Structure selectors:**
```python
from Bio.PDB.Polypeptide import is_aa

# Filter by residue type
for residue in structure.get_residues():
    if is_aa(residue):
        print(f"Amino acid: {residue.resname}")
```

#### Secondary Structure Analysis

**DSSP integration:**
```python
from Bio.PDB import DSSP

# Requires DSSP program installed
model = structure[0]
dssp = DSSP(model, "structure.pdb")

# Access secondary structure
for key in dssp:
    secondary_structure = dssp[key][2]
    accessibility = dssp[key][3]
    print(f"Residue {key}: {secondary_structure}, accessible: {accessibility}")
```

DSSP codes:
- H: Alpha helix
- B: Beta bridge
- E: Extended strand (beta sheet)
- G: 3-10 helix
- I: Pi helix
- T: Turn
- S: Bend
- -: Coil

#### Solvent Accessibility

**Shrake-Rupley algorithm:**
```python
from Bio.PDB import ShrakeRupley

sr = ShrakeRupley()
sr.compute(structure, level="R")  # R=residue, A=atom, C=chain, M=model, S=structure

for residue in structure.get_residues():
    print(f"{residue.resname} {residue.id[1]}: {residue.sasa} Ų")
```

**NACCESS wrapper:**
```python
from Bio.PDB import NACCESS

# Requires NACCESS program
naccess = NACCESS("structure.pdb")
for residue_id, data in naccess.items():
    print(f"Residue {residue_id}: {data['all_atoms_abs']} Ų")
```

**Half-sphere exposure:**
```python
from Bio.PDB import HSExposure

# Requires DSSP
model = structure[0]
hse = HSExposure()
hse.calc_hs_exposure(model, "structure.pdb")

for chain in model:
    for residue in chain:
        if residue.has_id('EXP_HSE_A_U'):
            hse_up = residue.xtra['EXP_HSE_A_U']
            hse_down = residue.xtra['EXP_HSE_A_D']
```

#### Structural Alignment and Superimposition

**Standard superimposition:**
```python
from Bio.PDB import Superimposer

sup = Superimposer()
sup.set_atoms(ref_atoms, alt_atoms)  # Lists of atoms to align
sup.apply(structure2.get_atoms())  # Apply transformation

print(f"RMSD: {sup.rms}")
print(f"Rotation matrix: {sup.rotran[0]}")
print(f"Translation vector: {sup.rotran[1]}")
```

**QCP (Quaternion Characteristic Polynomial) method:**
```python
from Bio.PDB import QCPSuperimposer

qcp = QCPSuperimposer()
qcp.set(ref_coords, alt_coords)
qcp.run()
print(f"RMSD: {qcp.get_rms()}")
```

#### Geometric Calculations

**Distances and angles:**
```python
# Distance between atoms
from Bio.PDB import Vector
dist = atom1 - atom2  # Returns distance

# Angle between three atoms
from Bio.PDB import calc_angle
angle = calc_angle(atom1.coord, atom2.coord, atom3.coord)

# Dihedral angle
from Bio.PDB import calc_dihedral
dihedral = calc_dihedral(atom1.coord, atom2.coord, atom3.coord, atom4.coord)
```

**Vector operations:**
```python
from Bio.PDB.Vector import Vector

v1 = Vector(atom1.coord)
v2 = Vector(atom2.coord)

# Vector operations
v3 = v1 + v2
v4 = v1 - v2
dot_product = v1 * v2
cross_product = v1 ** v2
magnitude = v1.norm()
normalized = v1.normalized()
```

#### Internal Coordinates

Advanced residue geometry representation:
```python
from Bio.PDB import internal_coords

# Enable internal coordinates
structure.atom_to_internal_coordinates()

# Access phi, psi angles
for residue in structure.get_residues():
    if residue.internal_coord:
        print(f"Phi: {residue.internal_coord.get_angle('phi')}")
        print(f"Psi: {residue.internal_coord.get_angle('psi')}")
```

#### Writing Structures

```python
from Bio.PDB import PDBIO

io = PDBIO()
io.set_structure(structure)
io.save("output.pdb")

# Save specific selection
io.save("chain_A.pdb", select=ChainSelector("A"))
```

### Bio.SCOP - SCOP Database

Access to Structural Classification of Proteins database.

### Bio.KEGG - Pathway Analysis

Interface to KEGG (Kyoto Encyclopedia of Genes and Genomes) databases:

**Capabilities:**
- Access pathway maps
- Retrieve enzyme data
- Get compound information
- Query orthology relationships

## Phylogenetics

### Bio.Phylo - Phylogenetic Tree Analysis

Comprehensive phylogenetic tree manipulation and analysis.

#### Reading and Writing Trees

**Supported formats:**
- Newick: Simple, widely-used format
- NEXUS: Rich metadata format
- PhyloXML: XML-based with extensive annotations
- NeXML: Modern XML standard

```python
from Bio import Phylo

# Read tree
tree = Phylo.read("tree.nwk", "newick")

# Read multiple trees
trees = list(Phylo.parse("trees.nex", "nexus"))

# Write tree
Phylo.write(tree, "output.nwk", "newick")
```

#### Tree Visualization

**ASCII visualization:**
```python
Phylo.draw_ascii(tree)
```

**Matplotlib plotting:**
```python
import matplotlib.pyplot as plt
Phylo.draw(tree)
plt.show()

# With customization
fig, ax = plt.subplots(figsize=(10, 8))
Phylo.draw(tree, axes=ax, do_show=False)
ax.set_title("My Phylogenetic Tree")
plt.show()
```

#### Tree Navigation and Manipulation

**Find clades:**
```python
# Get all terminal nodes (leaves)
terminals = tree.get_terminals()

# Get all nonterminal nodes
nonterminals = tree.get_nonterminals()

# Find specific clade
target = tree.find_any(name="Species_A")

# Find all matching clades
matches = tree.find_clades(terminal=True)
```

**Tree properties:**
```python
# Count terminals
num_species = tree.count_terminals()

# Get total branch length
total_length = tree.total_branch_length()

# Check if tree is bifurcating
is_bifurcating = tree.is_bifurcating()

# Get maximum distance from root
max_dist = tree.distance(tree.root)
```

**Tree modification:**
```python
# Prune tree to specific taxa
keep_taxa = ["Species_A", "Species_B", "Species_C"]
tree.prune(keep_taxa)

# Collapse short branches
tree.collapse_all(lambda c: c.branch_length < 0.01)

# Ladderize (sort branches)
tree.ladderize()

# Root tree at midpoint
tree.root_at_midpoint()

# Root at specific clade
outgroup = tree.find_any(name="Outgroup_species")
tree.root_with_outgroup(outgroup)
```

**Calculate distances:**
```python
# Distance between two clades
dist = tree.distance(clade1, clade2)

# Distance from root
root_dist = tree.distance(tree.root, terminal_clade)
```

#### Tree Construction

**Distance-based methods:**
```python
from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator
from Bio import AlignIO

# Load alignment
aln = AlignIO.read("alignment.fasta", "fasta")

# Calculate distance matrix
calculator = DistanceCalculator('identity')
dm = calculator.get_distance(aln)

# Construct tree using UPGMA
constructor = DistanceTreeConstructor()
tree_upgma = constructor.upgma(dm)

# Or using Neighbor-Joining
tree_nj = constructor.nj(dm)
```

**Parsimony method:**
```python
from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher

scorer = ParsimonyScorer()
searcher = NNITreeSearcher(scorer)
tree = searcher.search(starting_tree, alignment)
```

**Distance calculators:**
- 'identity': Simple identity scoring
- 'blastn': BLAST nucleotide scoring
- 'blastp': BLAST protein scoring
- 'dnafull': EMBOSS DNA scoring matrix
- 'blosum62': BLOSUM62 protein matrix
- 'pam250': PAM250 protein matrix

#### Consensus Trees

```python
from Bio.Phylo.Consensus import majority_consensus, strict_consensus

# Strict consensus
consensus_strict = strict_consensus(trees)

# Majority rule consensus
consensus_majority = majority_consensus(trees, cutoff=0.5)

# Bootstrap consensus
from Bio.Phylo.Consensus import bootstrap_consensus
bootstrap_tree = bootstrap_consensus(trees, cutoff=0.7)
```

#### External Tool Wrappers

**PhyML:**
```python
from Bio.Phylo.Applications import PhymlCommandline

cmd = PhymlCommandline(input="alignment.phy", datatype="nt", model="HKY85", alpha="e", bootstrap=100)
stdout, stderr = cmd()
tree = Phylo.read("alignment.phy_phyml_tree.txt", "newick")
```

**RAxML:**
```python
from Bio.Phylo.Applications import RaxmlCommandline

cmd = RaxmlCommandline(
    sequences="alignment.phy",
    model="GTRGAMMA",
    name="mytree",
    parsimony_seed=12345
)
stdout, stderr = cmd()
```

**FastTree:**
```python
from Bio.Phylo.Applications import FastTreeCommandline

cmd = FastTreeCommandline(input="alignment.fasta", out="tree.nwk", gtr=True, gamma=True)
stdout, stderr = cmd()
```

### Bio.Phylo.PAML - Evolutionary Analysis

Interface to PAML (Phylogenetic Analysis by Maximum Likelihood):

**CODEML - Codon-based analysis:**
```python
from Bio.Phylo.PAML import codeml

cml = codeml.Codeml()
cml.alignment = "alignment.phy"
cml.tree = "tree.nwk"
cml.out_file = "results.out"
cml.working_dir = "./paml_wd"

# Set parameters
cml.set_options(
    seqtype=1,      # Codon sequences
    model=0,        # One omega ratio
    NSsites=[0, 1, 2],  # Test different models
    CodonFreq=2,    # F3x4 codon frequencies
)

results = cml.run()
```

**BaseML - Nucleotide-based analysis:**
```python
from Bio.Phylo.PAML import baseml

bml = baseml.Baseml()
bml.alignment = "alignment.phy"
bml.tree = "tree.nwk"
results = bml.run()
```

**YN00 - Yang-Nielsen method:**
```python
from Bio.Phylo.PAML import yn00

yn = yn00.Yn00()
yn.alignment = "alignment.phy"
results = yn.run()
```

## Population Genetics

### Bio.PopGen - Population Genetics Analysis

Tools for population-level genetic analysis.

**Capabilities:**
- Allele frequency calculations
- Hardy-Weinberg equilibrium testing
- Linkage disequilibrium analysis
- F-statistics (FST, FIS, FIT)
- Tajima's D
- Population structure analysis

## Clustering and Machine Learning

### Bio.Cluster - Clustering Algorithms

Statistical clustering for gene expression and other biological data:

**Hierarchical clustering:**
```python
from Bio.Cluster import treecluster

tree = treecluster(data, method='a', dist='e')
# method: 'a'=average, 's'=single, 'm'=maximum, 'c'=centroid
# dist: 'e'=Euclidean, 'c'=correlation, 'a'=absolute correlation
```

**k-means clustering:**
```python
from Bio.Cluster import kcluster

clusterid, error, nfound = kcluster(data, nclusters=5, npass=100)
```

**Self-Organizing Maps (SOM):**
```python
from Bio.Cluster import somcluster

clusterid, celldata = somcluster(data, nx=3, ny=3)
```

**Principal Component Analysis:**
```python
from Bio.Cluster import pca

columnmean, coordinates, components, eigenvalues = pca(data)
```

## Visualization

### Bio.Graphics - Genomic Visualization

Tools for creating publication-quality biological graphics.

**GenomeDiagram - Circular and linear genome maps:**
```python
from Bio.Graphics import GenomeDiagram
from Bio import SeqIO

record = SeqIO.read("genome.gb", "genbank")

gd_diagram = GenomeDiagram.Diagram("Genome Map")
gd_track = gd_diagram.new_track(1, greytrack=True)
gd_feature_set = gd_track.new_set()

# Add features
for feature in record.features:
    if feature.type == "gene":
        gd_feature_set.add_feature(feature, color="blue", label=True)

gd_diagram.draw(format="linear", pagesize='A4', fragments=1)
gd_diagram.write("genome_map.pdf", "PDF")
```

**Chromosomes - Chromosome visualization:**
```python
from Bio.Graphics.BasicChromosome import Chromosome

chr = Chromosome("Chromosome 1")
chr.add("gene1", 1000, 2000, color="red")
chr.add("gene2", 3000, 4500, color="blue")
```

## Phenotype Analysis

### Bio.phenotype - Phenotypic Microarray Analysis

Tools for analyzing phenotypic microarray data (e.g., Biolog plates):

**Capabilities:**
- Parse PM plate data
- Growth curve analysis
- Compare phenotypic profiles
- Calculate similarity metrics