Add more scientific skills

2026-03-28 07:33:45 +08:00 · 2025-10-19 14:12:02 -07:00
parent 78d5ac2b56
commit 660c8574d0
210 changed files with 88957 additions and 1 deletions
--- a/scientific-packages/biopython/references/specialized_modules.md
+++ b/scientific-packages/biopython/references/specialized_modules.md
@@ -0,0 +1,612 @@
+# BioPython Specialized Analysis Modules
+
+This document covers BioPython's specialized modules for structural biology, phylogenetics, population genetics, and other advanced analyses.
+
+## Structural Bioinformatics
+
+### Bio.PDB - Protein Structure Analysis
+
+Comprehensive tools for handling macromolecular crystal structures.
+
+#### Structure Hierarchy
+
+PDB structures are organized hierarchically:
+- **Structure** → Models → Chains → Residues → Atoms
+
+```python
+from Bio.PDB import PDBParser
+
+parser = PDBParser()
+structure = parser.get_structure("protein", "1abc.pdb")
+
+# Navigate hierarchy
+for model in structure:
+    for chain in model:
+        for residue in chain:
+            for atom in residue:
+                print(atom.coord)  # xyz coordinates
+```
+
+#### Parsing Structure Files
+
+**PDB format:**
+```python
+from Bio.PDB import PDBParser
+parser = PDBParser(QUIET=True)
+structure = parser.get_structure("myprotein", "structure.pdb")
+```
+
+**mmCIF format:**
+```python
+from Bio.PDB import MMCIFParser
+parser = MMCIFParser(QUIET=True)
+structure = parser.get_structure("myprotein", "structure.cif")
+```
+
+**Fast mmCIF parser:**
+```python
+from Bio.PDB import FastMMCIFParser
+parser = FastMMCIFParser(QUIET=True)
+structure = parser.get_structure("myprotein", "structure.cif")
+```
+
+**MMTF format:**
+```python
+from Bio.PDB import MMTFParser
+parser = MMTFParser()
+structure = parser.get_structure("structure.mmtf")
+```
+
+**Binary CIF:**
+```python
+from Bio.PDB.binary_cif import BinaryCIFParser
+parser = BinaryCIFParser()
+structure = parser.get_structure("structure.bcif")
+```
+
+#### Downloading Structures
+
+```python
+from Bio.PDB import PDBList
+pdbl = PDBList()
+
+# Download specific structure
+pdbl.retrieve_pdb_file("1ABC", file_format="pdb", pdir="structures/")
+
+# Download entire PDB (obsolete entries)
+pdbl.download_obsolete_entries(pdir="obsolete/")
+
+# Update local PDB mirror
+pdbl.update_pdb()
+```
+
+#### Structure Selection and Filtering
+
+```python
+# Select specific chains
+chain_A = structure[0]['A']
+
+# Select specific residues
+residue_10 = chain_A[10]
+
+# Select specific atoms
+ca_atom = residue_10['CA']
+
+# Iterate over specific atom types
+for atom in structure.get_atoms():
+    if atom.name == 'CA':  # Alpha carbons only
+        print(atom.coord)
+```
+
+**Structure selectors:**
+```python
+from Bio.PDB.Polypeptide import is_aa
+
+# Filter by residue type
+for residue in structure.get_residues():
+    if is_aa(residue):
+        print(f"Amino acid: {residue.resname}")
+```
+
+#### Secondary Structure Analysis
+
+**DSSP integration:**
+```python
+from Bio.PDB import DSSP
+
+# Requires DSSP program installed
+model = structure[0]
+dssp = DSSP(model, "structure.pdb")
+
+# Access secondary structure
+for key in dssp:
+    secondary_structure = dssp[key][2]
+    accessibility = dssp[key][3]
+    print(f"Residue {key}: {secondary_structure}, accessible: {accessibility}")
+```
+
+DSSP codes:
+- H: Alpha helix
+- B: Beta bridge
+- E: Extended strand (beta sheet)
+- G: 3-10 helix
+- I: Pi helix
+- T: Turn
+- S: Bend
+- -: Coil
+
+#### Solvent Accessibility
+
+**Shrake-Rupley algorithm:**
+```python
+from Bio.PDB import ShrakeRupley
+
+sr = ShrakeRupley()
+sr.compute(structure, level="R")  # R=residue, A=atom, C=chain, M=model, S=structure
+
+for residue in structure.get_residues():
+    print(f"{residue.resname} {residue.id[1]}: {residue.sasa} Ų")
+```
+
+**NACCESS wrapper:**
+```python
+from Bio.PDB import NACCESS
+
+# Requires NACCESS program
+naccess = NACCESS("structure.pdb")
+for residue_id, data in naccess.items():
+    print(f"Residue {residue_id}: {data['all_atoms_abs']} Ų")
+```
+
+**Half-sphere exposure:**
+```python
+from Bio.PDB import HSExposure
+
+# Requires DSSP
+model = structure[0]
+hse = HSExposure()
+hse.calc_hs_exposure(model, "structure.pdb")
+
+for chain in model:
+    for residue in chain:
+        if residue.has_id('EXP_HSE_A_U'):
+            hse_up = residue.xtra['EXP_HSE_A_U']
+            hse_down = residue.xtra['EXP_HSE_A_D']
+```
+
+#### Structural Alignment and Superimposition
+
+**Standard superimposition:**
+```python
+from Bio.PDB import Superimposer
+
+sup = Superimposer()
+sup.set_atoms(ref_atoms, alt_atoms)  # Lists of atoms to align
+sup.apply(structure2.get_atoms())  # Apply transformation
+
+print(f"RMSD: {sup.rms}")
+print(f"Rotation matrix: {sup.rotran[0]}")
+print(f"Translation vector: {sup.rotran[1]}")
+```
+
+**QCP (Quaternion Characteristic Polynomial) method:**
+```python
+from Bio.PDB import QCPSuperimposer
+
+qcp = QCPSuperimposer()
+qcp.set(ref_coords, alt_coords)
+qcp.run()
+print(f"RMSD: {qcp.get_rms()}")
+```
+
+#### Geometric Calculations
+
+**Distances and angles:**
+```python
+# Distance between atoms
+from Bio.PDB import Vector
+dist = atom1 - atom2  # Returns distance
+
+# Angle between three atoms
+from Bio.PDB import calc_angle
+angle = calc_angle(atom1.coord, atom2.coord, atom3.coord)
+
+# Dihedral angle
+from Bio.PDB import calc_dihedral
+dihedral = calc_dihedral(atom1.coord, atom2.coord, atom3.coord, atom4.coord)
+```
+
+**Vector operations:**
+```python
+from Bio.PDB.Vector import Vector
+
+v1 = Vector(atom1.coord)
+v2 = Vector(atom2.coord)
+
+# Vector operations
+v3 = v1 + v2
+v4 = v1 - v2
+dot_product = v1 * v2
+cross_product = v1 ** v2
+magnitude = v1.norm()
+normalized = v1.normalized()
+```
+
+#### Internal Coordinates
+
+Advanced residue geometry representation:
+```python
+from Bio.PDB import internal_coords
+
+# Enable internal coordinates
+structure.atom_to_internal_coordinates()
+
+# Access phi, psi angles
+for residue in structure.get_residues():
+    if residue.internal_coord:
+        print(f"Phi: {residue.internal_coord.get_angle('phi')}")
+        print(f"Psi: {residue.internal_coord.get_angle('psi')}")
+```
+
+#### Writing Structures
+
+```python
+from Bio.PDB import PDBIO
+
+io = PDBIO()
+io.set_structure(structure)
+io.save("output.pdb")
+
+# Save specific selection
+io.save("chain_A.pdb", select=ChainSelector("A"))
+```
+
+### Bio.SCOP - SCOP Database
+
+Access to Structural Classification of Proteins database.
+
+### Bio.KEGG - Pathway Analysis
+
+Interface to KEGG (Kyoto Encyclopedia of Genes and Genomes) databases:
+
+**Capabilities:**
+- Access pathway maps
+- Retrieve enzyme data
+- Get compound information
+- Query orthology relationships
+
+## Phylogenetics
+
+### Bio.Phylo - Phylogenetic Tree Analysis
+
+Comprehensive phylogenetic tree manipulation and analysis.
+
+#### Reading and Writing Trees
+
+**Supported formats:**
+- Newick: Simple, widely-used format
+- NEXUS: Rich metadata format
+- PhyloXML: XML-based with extensive annotations
+- NeXML: Modern XML standard
+
+```python
+from Bio import Phylo
+
+# Read tree
+tree = Phylo.read("tree.nwk", "newick")
+
+# Read multiple trees
+trees = list(Phylo.parse("trees.nex", "nexus"))
+
+# Write tree
+Phylo.write(tree, "output.nwk", "newick")
+```
+
+#### Tree Visualization
+
+**ASCII visualization:**
+```python
+Phylo.draw_ascii(tree)
+```
+
+**Matplotlib plotting:**
+```python
+import matplotlib.pyplot as plt
+Phylo.draw(tree)
+plt.show()
+
+# With customization
+fig, ax = plt.subplots(figsize=(10, 8))
+Phylo.draw(tree, axes=ax, do_show=False)
+ax.set_title("My Phylogenetic Tree")
+plt.show()
+```
+
+#### Tree Navigation and Manipulation
+
+**Find clades:**
+```python
+# Get all terminal nodes (leaves)
+terminals = tree.get_terminals()
+
+# Get all nonterminal nodes
+nonterminals = tree.get_nonterminals()
+
+# Find specific clade
+target = tree.find_any(name="Species_A")
+
+# Find all matching clades
+matches = tree.find_clades(terminal=True)
+```
+
+**Tree properties:**
+```python
+# Count terminals
+num_species = tree.count_terminals()
+
+# Get total branch length
+total_length = tree.total_branch_length()
+
+# Check if tree is bifurcating
+is_bifurcating = tree.is_bifurcating()
+
+# Get maximum distance from root
+max_dist = tree.distance(tree.root)
+```
+
+**Tree modification:**
+```python
+# Prune tree to specific taxa
+keep_taxa = ["Species_A", "Species_B", "Species_C"]
+tree.prune(keep_taxa)
+
+# Collapse short branches
+tree.collapse_all(lambda c: c.branch_length < 0.01)
+
+# Ladderize (sort branches)
+tree.ladderize()
+
+# Root tree at midpoint
+tree.root_at_midpoint()
+
+# Root at specific clade
+outgroup = tree.find_any(name="Outgroup_species")
+tree.root_with_outgroup(outgroup)
+```
+
+**Calculate distances:**
+```python
+# Distance between two clades
+dist = tree.distance(clade1, clade2)
+
+# Distance from root
+root_dist = tree.distance(tree.root, terminal_clade)
+```
+
+#### Tree Construction
+
+**Distance-based methods:**
+```python
+from Bio.Phylo.TreeConstruction import DistanceTreeConstructor, DistanceCalculator
+from Bio import AlignIO
+
+# Load alignment
+aln = AlignIO.read("alignment.fasta", "fasta")
+
+# Calculate distance matrix
+calculator = DistanceCalculator('identity')
+dm = calculator.get_distance(aln)
+
+# Construct tree using UPGMA
+constructor = DistanceTreeConstructor()
+tree_upgma = constructor.upgma(dm)
+
+# Or using Neighbor-Joining
+tree_nj = constructor.nj(dm)
+```
+
+**Parsimony method:**
+```python
+from Bio.Phylo.TreeConstruction import ParsimonyScorer, NNITreeSearcher
+
+scorer = ParsimonyScorer()
+searcher = NNITreeSearcher(scorer)
+tree = searcher.search(starting_tree, alignment)
+```
+
+**Distance calculators:**
+- 'identity': Simple identity scoring
+- 'blastn': BLAST nucleotide scoring
+- 'blastp': BLAST protein scoring
+- 'dnafull': EMBOSS DNA scoring matrix
+- 'blosum62': BLOSUM62 protein matrix
+- 'pam250': PAM250 protein matrix
+
+#### Consensus Trees
+
+```python
+from Bio.Phylo.Consensus import majority_consensus, strict_consensus
+
+# Strict consensus
+consensus_strict = strict_consensus(trees)
+
+# Majority rule consensus
+consensus_majority = majority_consensus(trees, cutoff=0.5)
+
+# Bootstrap consensus
+from Bio.Phylo.Consensus import bootstrap_consensus
+bootstrap_tree = bootstrap_consensus(trees, cutoff=0.7)
+```
+
+#### External Tool Wrappers
+
+**PhyML:**
+```python
+from Bio.Phylo.Applications import PhymlCommandline
+
+cmd = PhymlCommandline(input="alignment.phy", datatype="nt", model="HKY85", alpha="e", bootstrap=100)
+stdout, stderr = cmd()
+tree = Phylo.read("alignment.phy_phyml_tree.txt", "newick")
+```
+
+**RAxML:**
+```python
+from Bio.Phylo.Applications import RaxmlCommandline
+
+cmd = RaxmlCommandline(
+    sequences="alignment.phy",
+    model="GTRGAMMA",
+    name="mytree",
+    parsimony_seed=12345
+)
+stdout, stderr = cmd()
+```
+
+**FastTree:**
+```python
+from Bio.Phylo.Applications import FastTreeCommandline
+
+cmd = FastTreeCommandline(input="alignment.fasta", out="tree.nwk", gtr=True, gamma=True)
+stdout, stderr = cmd()
+```
+
+### Bio.Phylo.PAML - Evolutionary Analysis
+
+Interface to PAML (Phylogenetic Analysis by Maximum Likelihood):
+
+**CODEML - Codon-based analysis:**
+```python
+from Bio.Phylo.PAML import codeml
+
+cml = codeml.Codeml()
+cml.alignment = "alignment.phy"
+cml.tree = "tree.nwk"
+cml.out_file = "results.out"
+cml.working_dir = "./paml_wd"
+
+# Set parameters
+cml.set_options(
+    seqtype=1,      # Codon sequences
+    model=0,        # One omega ratio
+    NSsites=[0, 1, 2],  # Test different models
+    CodonFreq=2,    # F3x4 codon frequencies
+)
+
+results = cml.run()
+```
+
+**BaseML - Nucleotide-based analysis:**
+```python
+from Bio.Phylo.PAML import baseml
+
+bml = baseml.Baseml()
+bml.alignment = "alignment.phy"
+bml.tree = "tree.nwk"
+results = bml.run()
+```
+
+**YN00 - Yang-Nielsen method:**
+```python
+from Bio.Phylo.PAML import yn00
+
+yn = yn00.Yn00()
+yn.alignment = "alignment.phy"
+results = yn.run()
+```
+
+## Population Genetics
+
+### Bio.PopGen - Population Genetics Analysis
+
+Tools for population-level genetic analysis.
+
+**Capabilities:**
+- Allele frequency calculations
+- Hardy-Weinberg equilibrium testing
+- Linkage disequilibrium analysis
+- F-statistics (FST, FIS, FIT)
+- Tajima's D
+- Population structure analysis
+
+## Clustering and Machine Learning
+
+### Bio.Cluster - Clustering Algorithms
+
+Statistical clustering for gene expression and other biological data:
+
+**Hierarchical clustering:**
+```python
+from Bio.Cluster import treecluster
+
+tree = treecluster(data, method='a', dist='e')
+# method: 'a'=average, 's'=single, 'm'=maximum, 'c'=centroid
+# dist: 'e'=Euclidean, 'c'=correlation, 'a'=absolute correlation
+```
+
+**k-means clustering:**
+```python
+from Bio.Cluster import kcluster
+
+clusterid, error, nfound = kcluster(data, nclusters=5, npass=100)
+```
+
+**Self-Organizing Maps (SOM):**
+```python
+from Bio.Cluster import somcluster
+
+clusterid, celldata = somcluster(data, nx=3, ny=3)
+```
+
+**Principal Component Analysis:**
+```python
+from Bio.Cluster import pca
+
+columnmean, coordinates, components, eigenvalues = pca(data)
+```
+
+## Visualization
+
+### Bio.Graphics - Genomic Visualization
+
+Tools for creating publication-quality biological graphics.
+
+**GenomeDiagram - Circular and linear genome maps:**
+```python
+from Bio.Graphics import GenomeDiagram
+from Bio import SeqIO
+
+record = SeqIO.read("genome.gb", "genbank")
+
+gd_diagram = GenomeDiagram.Diagram("Genome Map")
+gd_track = gd_diagram.new_track(1, greytrack=True)
+gd_feature_set = gd_track.new_set()
+
+# Add features
+for feature in record.features:
+    if feature.type == "gene":
+        gd_feature_set.add_feature(feature, color="blue", label=True)
+
+gd_diagram.draw(format="linear", pagesize='A4', fragments=1)
+gd_diagram.write("genome_map.pdf", "PDF")
+```
+
+**Chromosomes - Chromosome visualization:**
+```python
+from Bio.Graphics.BasicChromosome import Chromosome
+
+chr = Chromosome("Chromosome 1")
+chr.add("gene1", 1000, 2000, color="red")
+chr.add("gene2", 3000, 4500, color="blue")
+```
+
+## Phenotype Analysis
+
+### Bio.phenotype - Phenotypic Microarray Analysis
+
+Tools for analyzing phenotypic microarray data (e.g., Biolog plates):
+
+**Capabilities:**
+- Parse PM plate data
+- Growth curve analysis
+- Compare phenotypic profiles
+- Calculate similarity metrics