Add scVelo RNA velocity analysis workflow and IQ-TREE reference documentation

- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization. - Added a script for running RNA velocity analysis with customizable parameters and output options. - Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation. - Included references for velocity models and their mathematical framework, along with a comparison of different models. - Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
2026-03-27 07:09:27 +08:00 · 2026-03-03 07:15:36 -05:00
parent b271271df4
commit 7f94783fab
27 changed files with 6961 additions and 0 deletions
--- a/scientific-skills/molecular-dynamics/SKILL.md
+++ b/scientific-skills/molecular-dynamics/SKILL.md
@@ -0,0 +1,457 @@
+---
+name: molecular-dynamics
+description: Run and analyze molecular dynamics simulations with OpenMM and MDAnalysis. Set up protein/small molecule systems, define force fields, run energy minimization and production MD, analyze trajectories (RMSD, RMSF, contact maps, free energy surfaces). For structural biology, drug binding, and biophysics.
+license: MIT
+metadata:
+    skill-author: Kuan-lin Huang
+---
+
+# Molecular Dynamics
+
+## Overview
+
+Molecular dynamics (MD) simulation computationally models the time evolution of molecular systems by integrating Newton's equations of motion. This skill covers two complementary tools:
+
+- **OpenMM** (https://openmm.org/): High-performance MD simulation engine with GPU support, Python API, and flexible force field support
+- **MDAnalysis** (https://mdanalysis.org/): Python library for reading, writing, and analyzing MD trajectories from all major simulation packages
+
+**Installation:**
+```bash
+conda install -c conda-forge openmm mdanalysis nglview
+# or
+pip install openmm mdanalysis
+```
+
+## When to Use This Skill
+
+Use molecular dynamics when:
+
+- **Protein stability analysis**: How does a mutation affect protein dynamics?
+- **Drug binding simulations**: Characterize binding mode and residence time of a ligand
+- **Conformational sampling**: Explore protein flexibility and conformational changes
+- **Protein-protein interaction**: Model interface dynamics and binding energetics
+- **RMSD/RMSF analysis**: Quantify structural fluctuations from a reference structure
+- **Free energy estimation**: Compute binding free energy or conformational free energy
+- **Membrane simulations**: Model proteins in lipid bilayers
+- **Intrinsically disordered proteins**: Study IDR conformational ensembles
+
+## Core Workflow: OpenMM Simulation
+
+### 1. System Preparation
+
+```python
+from openmm.app import *
+from openmm import *
+from openmm.unit import *
+import sys
+
+def prepare_system_from_pdb(pdb_file, forcefield_name="amber14-all.xml",
+                              water_model="amber14/tip3pfb.xml"):
+    """
+    Prepare an OpenMM system from a PDB file.
+
+    Args:
+        pdb_file: Path to cleaned PDB file (use PDBFixer for raw PDB files)
+        forcefield_name: Force field XML file
+        water_model: Water model XML file
+
+    Returns:
+        pdb, forcefield, system, topology
+    """
+    # Load PDB
+    pdb = PDBFile(pdb_file)
+
+    # Load force field
+    forcefield = ForceField(forcefield_name, water_model)
+
+    # Add hydrogens and solvate
+    modeller = Modeller(pdb.topology, pdb.positions)
+    modeller.addHydrogens(forcefield)
+
+    # Add solvent box (10 Å padding, 150 mM NaCl)
+    modeller.addSolvent(
+        forcefield,
+        model='tip3p',
+        padding=10*angstroms,
+        ionicStrength=0.15*molar
+    )
+
+    print(f"System: {modeller.topology.getNumAtoms()} atoms, "
+          f"{modeller.topology.getNumResidues()} residues")
+
+    # Create system
+    system = forcefield.createSystem(
+        modeller.topology,
+        nonbondedMethod=PME,         # Particle Mesh Ewald for long-range electrostatics
+        nonbondedCutoff=1.0*nanometer,
+        constraints=HBonds,           # Constrain hydrogen bonds (allows 2 fs timestep)
+        rigidWater=True,
+        ewaldErrorTolerance=0.0005
+    )
+
+    return modeller, system
+```
+
+### 2. Energy Minimization
+
+```python
+from openmm.app import *
+from openmm import *
+from openmm.unit import *
+
+def minimize_energy(modeller, system, output_pdb="minimized.pdb",
+                     max_iterations=1000, tolerance=10.0):
+    """
+    Energy minimize the system to remove steric clashes.
+
+    Args:
+        modeller: Modeller object with topology and positions
+        system: OpenMM System
+        output_pdb: Path to save minimized structure
+        max_iterations: Maximum minimization steps
+        tolerance: Convergence criterion in kJ/mol/nm
+
+    Returns:
+        simulation object with minimized positions
+    """
+    # Set up integrator (doesn't matter for minimization)
+    integrator = LangevinMiddleIntegrator(300*kelvin, 1/picosecond, 0.004*picoseconds)
+
+    # Create simulation
+    # Use GPU if available (CUDA or OpenCL), fall back to CPU
+    try:
+        platform = Platform.getPlatformByName('CUDA')
+        properties = {'DeviceIndex': '0', 'Precision': 'mixed'}
+    except Exception:
+        try:
+            platform = Platform.getPlatformByName('OpenCL')
+            properties = {}
+        except Exception:
+            platform = Platform.getPlatformByName('CPU')
+            properties = {}
+
+    simulation = Simulation(
+        modeller.topology, system, integrator,
+        platform, properties
+    )
+    simulation.context.setPositions(modeller.positions)
+
+    # Check initial energy
+    state = simulation.context.getState(getEnergy=True)
+    print(f"Initial energy: {state.getPotentialEnergy()}")
+
+    # Minimize
+    simulation.minimizeEnergy(
+        tolerance=tolerance*kilojoules_per_mole/nanometer,
+        maxIterations=max_iterations
+    )
+
+    state = simulation.context.getState(getEnergy=True, getPositions=True)
+    print(f"Minimized energy: {state.getPotentialEnergy()}")
+
+    # Save minimized structure
+    with open(output_pdb, 'w') as f:
+        PDBFile.writeFile(simulation.topology, state.getPositions(), f)
+
+    return simulation
+```
+
+### 3. NVT Equilibration
+
+```python
+from openmm.app import *
+from openmm import *
+from openmm.unit import *
+
+def run_nvt_equilibration(simulation, n_steps=50000, temperature=300,
+                            report_interval=1000, output_prefix="nvt"):
+    """
+    NVT equilibration: constant N, V, T.
+    Equilibrate velocities to target temperature.
+
+    Args:
+        simulation: OpenMM Simulation (after minimization)
+        n_steps: Number of MD steps (50000 × 2fs = 100 ps)
+        temperature: Temperature in Kelvin
+        report_interval: Steps between data reports
+        output_prefix: File prefix for trajectory and log
+    """
+    # Add position restraints for backbone during NVT
+    # (Optional: restraint heavy atoms)
+
+    # Set temperature
+    simulation.context.setVelocitiesToTemperature(temperature*kelvin)
+
+    # Add reporters
+    simulation.reporters = []
+
+    # Log file
+    simulation.reporters.append(
+        StateDataReporter(
+            f"{output_prefix}_log.txt",
+            report_interval,
+            step=True,
+            potentialEnergy=True,
+            kineticEnergy=True,
+            temperature=True,
+            volume=True,
+            speed=True
+        )
+    )
+
+    # DCD trajectory (compact binary format)
+    simulation.reporters.append(
+        DCDReporter(f"{output_prefix}_traj.dcd", report_interval)
+    )
+
+    print(f"Running NVT equilibration: {n_steps} steps ({n_steps*2/1000:.1f} ps)")
+    simulation.step(n_steps)
+    print("NVT equilibration complete")
+
+    return simulation
+```
+
+### 4. NPT Equilibration and Production
+
+```python
+def run_npt_production(simulation, n_steps=500000, temperature=300, pressure=1.0,
+                        report_interval=5000, output_prefix="npt"):
+    """
+    NPT production run: constant N, P, T.
+
+    Args:
+        n_steps: Production steps (500000 × 2fs = 1 ns)
+        temperature: Temperature in Kelvin
+        pressure: Pressure in bar
+        report_interval: Steps between reports
+    """
+    # Add Monte Carlo barostat for pressure control
+    system = simulation.context.getSystem()
+    system.addForce(MonteCarloBarostat(pressure*bar, temperature*kelvin, 25))
+    simulation.context.reinitialize(preserveState=True)
+
+    # Update reporters
+    simulation.reporters = []
+    simulation.reporters.append(
+        StateDataReporter(
+            f"{output_prefix}_log.txt",
+            report_interval,
+            step=True,
+            potentialEnergy=True,
+            temperature=True,
+            density=True,
+            speed=True
+        )
+    )
+    simulation.reporters.append(
+        DCDReporter(f"{output_prefix}_traj.dcd", report_interval)
+    )
+
+    # Save checkpoints
+    simulation.reporters.append(
+        CheckpointReporter(f"{output_prefix}_checkpoint.chk", 50000)
+    )
+
+    print(f"Running NPT production: {n_steps} steps ({n_steps*2/1000000:.2f} ns)")
+    simulation.step(n_steps)
+    print("Production MD complete")
+    return simulation
+```
+
+## Trajectory Analysis with MDAnalysis
+
+### 1. Load Trajectory
+
+```python
+import MDAnalysis as mda
+from MDAnalysis.analysis import rms, align, contacts
+import numpy as np
+import matplotlib.pyplot as plt
+
+def load_trajectory(topology_file, trajectory_file):
+    """
+    Load an MD trajectory with MDAnalysis.
+
+    Args:
+        topology_file: PDB, PSF, or other topology file
+        trajectory_file: DCD, XTC, TRR, or other trajectory
+    """
+    u = mda.Universe(topology_file, trajectory_file)
+    print(f"Universe: {u.atoms.n_atoms} atoms, {u.trajectory.n_frames} frames")
+    print(f"Time range: 0 to {u.trajectory.totaltime:.0f} ps")
+    return u
+```
+
+### 2. RMSD Analysis
+
+```python
+def compute_rmsd(u, selection="backbone", reference_frame=0):
+    """
+    Compute RMSD of selected atoms relative to reference frame.
+
+    Args:
+        u: MDAnalysis Universe
+        selection: Atom selection string (MDAnalysis syntax)
+        reference_frame: Frame index for reference structure
+
+    Returns:
+        numpy array of (time, rmsd) values
+    """
+    # Align trajectory to minimize RMSD
+    aligner = align.AlignTraj(u, u, select=selection, in_memory=True)
+    aligner.run()
+
+    # Compute RMSD
+    R = rms.RMSD(u, select=selection, ref_frame=reference_frame)
+    R.run()
+
+    rmsd_data = R.results.rmsd  # columns: frame, time, RMSD
+    return rmsd_data
+
+def plot_rmsd(rmsd_data, title="RMSD over time", output_file="rmsd.png"):
+    """Plot RMSD over simulation time."""
+    fig, ax = plt.subplots(figsize=(10, 4))
+    ax.plot(rmsd_data[:, 1] / 1000, rmsd_data[:, 2], 'b-', linewidth=0.5)
+    ax.set_xlabel("Time (ns)")
+    ax.set_ylabel("RMSD (Å)")
+    ax.set_title(title)
+    ax.axhline(rmsd_data[:, 2].mean(), color='r', linestyle='--',
+               label=f'Mean: {rmsd_data[:, 2].mean():.2f} Å')
+    ax.legend()
+    plt.tight_layout()
+    plt.savefig(output_file, dpi=150)
+    return fig
+```
+
+### 3. RMSF Analysis (Per-Residue Flexibility)
+
+```python
+def compute_rmsf(u, selection="backbone", start_frame=0):
+    """
+    Compute per-residue RMSF (flexibility).
+
+    Returns:
+        resids, rmsf_values arrays
+    """
+    # Select atoms
+    atoms = u.select_atoms(selection)
+
+    # Compute RMSF
+    R = rms.RMSF(atoms)
+    R.run(start=start_frame)
+
+    # Average by residue
+    resids = []
+    rmsf_per_res = []
+    for res in u.select_atoms(selection).residues:
+        res_atoms = res.atoms.intersection(atoms)
+        if len(res_atoms) > 0:
+            resids.append(res.resid)
+            rmsf_per_res.append(R.results.rmsf[res_atoms.indices].mean())
+
+    return np.array(resids), np.array(rmsf_per_res)
+```
+
+### 4. Protein-Ligand Contacts
+
+```python
+def analyze_contacts(u, protein_sel="protein", ligand_sel="resname LIG",
+                      radius=4.5, start_frame=0):
+    """
+    Track protein-ligand contacts over trajectory.
+
+    Args:
+        radius: Contact distance cutoff in Angstroms
+    """
+    protein = u.select_atoms(protein_sel)
+    ligand = u.select_atoms(ligand_sel)
+
+    contact_frames = []
+    for ts in u.trajectory[start_frame:]:
+        # Find protein atoms within radius of ligand
+        distances = contacts.contact_matrix(
+            protein.positions, ligand.positions, radius
+        )
+        contact_residues = set()
+        for i in range(distances.shape[0]):
+            if distances[i].any():
+                contact_residues.add(protein.atoms[i].resid)
+        contact_frames.append(contact_residues)
+
+    return contact_frames
+```
+
+## Force Field Selection Guide
+
+| System | Recommended Force Field | Water Model |
+|--------|------------------------|-------------|
+| Standard proteins | AMBER14 (`amber14-all.xml`) | TIP3P-FB |
+| Proteins + small molecules | AMBER14 + GAFF2 | TIP3P-FB |
+| Membrane proteins | CHARMM36m | TIP3P |
+| Nucleic acids | AMBER99-bsc1 or AMBER14 | TIP3P |
+| Disordered proteins | ff19SB or CHARMM36m | TIP3P |
+
+## System Preparation Tools
+
+### PDBFixer (for raw PDB files)
+
+```python
+from pdbfixer import PDBFixer
+from openmm.app import PDBFile
+
+def fix_pdb(input_pdb, output_pdb, ph=7.0):
+    """Fix common PDB issues: missing residues, atoms, add H, standardize."""
+    fixer = PDBFixer(filename=input_pdb)
+    fixer.findMissingResidues()
+    fixer.findNonstandardResidues()
+    fixer.replaceNonstandardResidues()
+    fixer.removeHeterogens(True)    # Remove water/ligands
+    fixer.findMissingAtoms()
+    fixer.addMissingAtoms()
+    fixer.addMissingHydrogens(ph)
+
+    with open(output_pdb, 'w') as f:
+        PDBFile.writeFile(fixer.topology, fixer.positions, f)
+
+    return output_pdb
+```
+
+### GAFF2 for Small Molecules (via OpenFF Toolkit)
+
+```python
+# For ligand parameterization, use OpenFF toolkit or ACPYPE
+# pip install openff-toolkit
+from openff.toolkit import Molecule, ForceField as OFFForceField
+from openff.interchange import Interchange
+
+def parameterize_ligand(smiles, ff_name="openff-2.0.0.offxml"):
+    """Generate GAFF2/OpenFF parameters for a small molecule."""
+    mol = Molecule.from_smiles(smiles)
+    mol.generate_conformers(n_conformers=1)
+
+    off_ff = OFFForceField(ff_name)
+    interchange = off_ff.create_interchange(mol.to_topology())
+    return interchange
+```
+
+## Best Practices
+
+- **Always minimize before MD**: Raw PDB structures have steric clashes
+- **Equilibrate before production**: NVT (50–100 ps) → NPT (100–500 ps) → Production
+- **Use GPU**: Simulations are 10–100× faster on GPU (CUDA/OpenCL)
+- **2 fs timestep with HBonds constraints**: Standard; use 4 fs with HMR (hydrogen mass repartitioning)
+- **Analyze only equilibrated trajectory**: Discard first 20–50% as equilibration
+- **Save checkpoints**: MD runs can fail; checkpoints allow restart
+- **Periodic boundary conditions**: Required for solvated systems
+- **PME for electrostatics**: More accurate than cutoff methods for charged systems
+
+## Additional Resources
+
+- **OpenMM documentation**: https://openmm.org/documentation.html
+- **MDAnalysis user guide**: https://docs.mdanalysis.org/
+- **GROMACS** (alternative MD engine): https://manual.gromacs.org/
+- **NAMD** (alternative): https://www.ks.uiuc.edu/Research/namd/
+- **CHARMM-GUI** (web-based system builder): https://charmm-gui.org/
+- **AmberTools** (free Amber tools): https://ambermd.org/AmberTools.php
+- **OpenMM paper**: Eastman P et al. (2017) PLOS Computational Biology. PMID: 28278240
+- **MDAnalysis paper**: Michaud-Agrawal N et al. (2011) J Computational Chemistry. PMID: 21500218
--- a/scientific-skills/molecular-dynamics/references/mdanalysis_analysis.md
+++ b/scientific-skills/molecular-dynamics/references/mdanalysis_analysis.md
@@ -0,0 +1,208 @@
+# MDAnalysis Analysis Reference
+
+## MDAnalysis Universe and AtomGroup
+
+```python
+import MDAnalysis as mda
+
+# Load Universe
+u = mda.Universe("topology.pdb", "trajectory.dcd")
+# or for single structure:
+u = mda.Universe("structure.pdb")
+
+# Key attributes
+print(u.atoms.n_atoms)          # Total atoms
+print(u.residues.n_residues)    # Total residues
+print(u.trajectory.n_frames)   # Number of frames
+print(u.trajectory.dt)         # Time step in ps
+print(u.trajectory.totaltime)  # Total simulation time in ps
+```
+
+## Atom Selection Language
+
+MDAnalysis uses a rich selection language:
+
+```python
+# Basic selections
+protein = u.select_atoms("protein")
+backbone = u.select_atoms("backbone")  # CA, N, C, O
+calpha = u.select_atoms("name CA")
+water = u.select_atoms("resname WAT or resname HOH or resname TIP3")
+ligand = u.select_atoms("resname LIG")
+
+# By residue number
+region = u.select_atoms("resid 10:50")
+specific = u.select_atoms("resid 45 and name CA")
+
+# By proximity
+near_ligand = u.select_atoms("protein and around 5.0 resname LIG")
+
+# By property
+charged = u.select_atoms("resname ARG LYS ASP GLU")
+hydrophobic = u.select_atoms("resname ALA VAL LEU ILE PRO PHE TRP MET")
+
+# Boolean combinations
+active_site = u.select_atoms("(resid 100 102 145 200) and protein")
+
+# Inverse
+not_water = u.select_atoms("not (resname WAT HOH)")
+```
+
+## Common Analysis Modules
+
+### RMSD and RMSF
+
+```python
+from MDAnalysis.analysis import rms, align
+
+# Align trajectory to first frame
+align.AlignTraj(u, u, select='backbone', in_memory=True).run()
+
+# RMSD
+R = rms.RMSD(u, u, select='backbone', groupselections=['name CA'])
+R.run()
+# R.results.rmsd: shape (n_frames, 3) = [frame, time, RMSD]
+
+# RMSF (per-atom fluctuations)
+from MDAnalysis.analysis.rms import RMSF
+rmsf = RMSF(u.select_atoms('backbone')).run()
+# rmsf.results.rmsf: per-atom RMSF values in Angstroms
+```
+
+### Radius of Gyration
+
+```python
+rg = []
+for ts in u.trajectory:
+    rg.append(u.select_atoms("protein").radius_of_gyration())
+import numpy as np
+print(f"Mean Rg: {np.mean(rg):.2f} Å")
+```
+
+### Secondary Structure Analysis
+
+```python
+from MDAnalysis.analysis.dssp import DSSP
+
+# DSSP secondary structure assignment per frame
+dssp = DSSP(u).run()
+# dssp.results.dssp: per-residue per-frame secondary structure codes
+# H = alpha-helix, E = beta-strand, C = coil
+```
+
+### Hydrogen Bonds
+
+```python
+from MDAnalysis.analysis.hydrogenbonds import HydrogenBondAnalysis
+
+hbonds = HydrogenBondAnalysis(
+    u,
+    donors_sel="protein and name N",
+    acceptors_sel="protein and name O",
+    d_h_cutoff=1.2,          # donor-H distance (Å)
+    d_a_cutoff=3.0,          # donor-acceptor distance (Å)
+    d_h_a_angle_cutoff=150   # D-H-A angle (degrees)
+)
+hbonds.run()
+
+# Count H-bonds per frame
+import pandas as pd
+df = pd.DataFrame(hbonds.results.hbonds,
+                  columns=['frame', 'donor_ix', 'hydrogen_ix', 'acceptor_ix',
+                           'DA_dist', 'DHA_angle'])
+```
+
+### Principal Component Analysis (PCA)
+
+```python
+from MDAnalysis.analysis import pca
+
+pca_analysis = pca.PCA(u, select='backbone', align=True).run()
+
+# PC variances
+print(pca_analysis.results.variance[:5])  # % variance of first 5 PCs
+
+# Project trajectory onto PCs
+projected = pca_analysis.transform(u.select_atoms('backbone'), n_components=3)
+# Shape: (n_frames, n_components)
+```
+
+### Free Energy Surface (FES)
+
+```python
+import numpy as np
+import matplotlib.pyplot as plt
+from scipy.stats import gaussian_kde
+
+def plot_free_energy_surface(x, y, bins=50, T=300, xlabel="PC1", ylabel="PC2",
+                              output="fes.png"):
+    """
+    Compute 2D free energy surface from two order parameters.
+    FES = -kT * ln(P(x,y))
+    """
+    kB = 0.0083144621  # kJ/mol/K
+    kT = kB * T
+
+    # 2D histogram
+    H, xedges, yedges = np.histogram2d(x, y, bins=bins, density=True)
+    H = H.T
+
+    # Free energy
+    H_safe = np.where(H > 0, H, np.nan)
+    fes = -kT * np.log(H_safe)
+    fes -= np.nanmin(fes)  # Shift minimum to 0
+
+    # Plot
+    fig, ax = plt.subplots(figsize=(8, 6))
+    im = ax.contourf(xedges[:-1], yedges[:-1], fes, levels=20, cmap='RdYlBu_r')
+    plt.colorbar(im, ax=ax, label='Free Energy (kJ/mol)')
+    ax.set_xlabel(xlabel)
+    ax.set_ylabel(ylabel)
+    plt.savefig(output, dpi=150, bbox_inches='tight')
+    return fig
+```
+
+## Trajectory Formats Supported
+
+| Format | Extension | Notes |
+|--------|-----------|-------|
+| DCD | `.dcd` | CHARMM/NAMD binary; widely used |
+| XTC | `.xtc` | GROMACS compressed |
+| TRR | `.trr` | GROMACS full precision |
+| NetCDF | `.nc` | AMBER format |
+| LAMMPS | `.lammpstrj` | LAMMPS dump |
+| HDF5 | `.h5md` | H5MD standard |
+| PDB | `.pdb` | Multi-model PDB |
+
+## MDAnalysis Interoperability
+
+```python
+# Convert to numpy
+positions = u.atoms.positions  # Current frame: shape (N, 3)
+
+# Write to PDB
+with mda.Writer("frame_10.pdb", u.atoms.n_atoms) as W:
+    u.trajectory[10]  # Move to frame 10
+    W.write(u.atoms)
+
+# Write trajectory subset
+with mda.Writer("protein_traj.dcd", u.select_atoms("protein").n_atoms) as W:
+    for ts in u.trajectory:
+        W.write(u.select_atoms("protein"))
+
+# Convert to MDTraj (for compatibility)
+# import mdtraj as md
+# traj = md.load("trajectory.dcd", top="topology.pdb")
+```
+
+## Performance Tips
+
+- **Use `in_memory=True`** for AlignTraj when RAM allows (much faster iteration)
+- **Select minimal atoms** before analysis to reduce memory/compute
+- **Use multiprocessing** for independent frame analyses
+- **Process in chunks** for very long trajectories using `start`/`stop`/`step` parameters:
+
+```python
+# Analyze every 10th frame from frame 100 to 1000
+R.run(start=100, stop=1000, step=10)
+```