mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization. - Added a script for running RNA velocity analysis with customizable parameters and output options. - Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation. - Included references for velocity models and their mathematical framework, along with a comparison of different models. - Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
405 lines
14 KiB
Markdown
405 lines
14 KiB
Markdown
---
|
||
name: phylogenetics
|
||
description: Build and analyze phylogenetic trees using MAFFT (multiple alignment), IQ-TREE 2 (maximum likelihood), and FastTree (fast NJ/ML). Visualize with ETE3 or FigTree. For evolutionary analysis, microbial genomics, viral phylodynamics, protein family analysis, and molecular clock studies.
|
||
license: Unknown
|
||
metadata:
|
||
skill-author: Kuan-lin Huang
|
||
---
|
||
|
||
# Phylogenetics
|
||
|
||
## Overview
|
||
|
||
Phylogenetic analysis reconstructs the evolutionary history of biological sequences (genes, proteins, genomes) by inferring the branching pattern of descent. This skill covers the standard pipeline:
|
||
|
||
1. **MAFFT** — Multiple sequence alignment
|
||
2. **IQ-TREE 2** — Maximum likelihood tree inference with model selection
|
||
3. **FastTree** — Fast approximate maximum likelihood (for large datasets)
|
||
4. **ETE3** — Python library for tree manipulation and visualization
|
||
|
||
**Installation:**
|
||
```bash
|
||
# Conda (recommended for CLI tools)
|
||
conda install -c bioconda mafft iqtree fasttree
|
||
pip install ete3
|
||
```
|
||
|
||
## When to Use This Skill
|
||
|
||
Use phylogenetics when:
|
||
|
||
- **Evolutionary relationships**: Which organism/gene is most closely related to my sequence?
|
||
- **Viral phylodynamics**: Trace outbreak spread and estimate transmission dates
|
||
- **Protein family analysis**: Infer evolutionary relationships within a gene family
|
||
- **Horizontal gene transfer detection**: Identify genes with discordant species/gene trees
|
||
- **Ancestral sequence reconstruction**: Infer ancestral protein sequences
|
||
- **Molecular clock analysis**: Estimate divergence dates using temporal sampling
|
||
- **GWAS companion**: Place variants in evolutionary context (e.g., SARS-CoV-2 variants)
|
||
- **Microbiology**: Species phylogeny from 16S rRNA or core genome phylogeny
|
||
|
||
## Standard Workflow
|
||
|
||
### 1. Multiple Sequence Alignment with MAFFT
|
||
|
||
```python
|
||
import subprocess
|
||
import os
|
||
|
||
def run_mafft(input_fasta: str, output_fasta: str, method: str = "auto",
|
||
n_threads: int = 4) -> str:
|
||
"""
|
||
Align sequences with MAFFT.
|
||
|
||
Args:
|
||
input_fasta: Path to unaligned FASTA file
|
||
output_fasta: Path for aligned output
|
||
method: 'auto' (auto-select), 'einsi' (accurate), 'linsi' (accurate, slow),
|
||
'fftnsi' (medium), 'fftns' (fast), 'retree2' (fast)
|
||
n_threads: Number of CPU threads
|
||
|
||
Returns:
|
||
Path to aligned FASTA file
|
||
"""
|
||
methods = {
|
||
"auto": ["mafft", "--auto"],
|
||
"einsi": ["mafft", "--genafpair", "--maxiterate", "1000"],
|
||
"linsi": ["mafft", "--localpair", "--maxiterate", "1000"],
|
||
"fftnsi": ["mafft", "--fftnsi"],
|
||
"fftns": ["mafft", "--fftns"],
|
||
"retree2": ["mafft", "--retree", "2"],
|
||
}
|
||
|
||
cmd = methods.get(method, methods["auto"])
|
||
cmd += ["--thread", str(n_threads), "--inputorder", input_fasta]
|
||
|
||
with open(output_fasta, 'w') as out:
|
||
result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)
|
||
|
||
if result.returncode != 0:
|
||
raise RuntimeError(f"MAFFT failed:\n{result.stderr}")
|
||
|
||
# Count aligned sequences
|
||
with open(output_fasta) as f:
|
||
n_seqs = sum(1 for line in f if line.startswith('>'))
|
||
print(f"MAFFT: aligned {n_seqs} sequences → {output_fasta}")
|
||
|
||
return output_fasta
|
||
|
||
# MAFFT method selection guide:
|
||
# Few sequences (<200), accurate: linsi or einsi
|
||
# Many sequences (<1000), moderate: fftnsi
|
||
# Large datasets (>1000): fftns or auto
|
||
# Ultra-fast (>10000): mafft --retree 1
|
||
```
|
||
|
||
### 2. Trim Alignment (Optional but Recommended)
|
||
|
||
```python
|
||
def trim_alignment_trimal(aligned_fasta: str, output_fasta: str,
|
||
method: str = "automated1") -> str:
|
||
"""
|
||
Trim poorly aligned columns with TrimAl.
|
||
|
||
Methods:
|
||
- 'automated1': Automatic heuristic (recommended)
|
||
- 'gappyout': Remove gappy columns
|
||
- 'strict': Strict gap threshold
|
||
"""
|
||
cmd = ["trimal", f"-{method}", "-in", aligned_fasta, "-out", output_fasta, "-fasta"]
|
||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||
if result.returncode != 0:
|
||
print(f"TrimAl warning: {result.stderr}")
|
||
# Fall back to using the untrimmed alignment
|
||
import shutil
|
||
shutil.copy(aligned_fasta, output_fasta)
|
||
return output_fasta
|
||
```
|
||
|
||
### 3. IQ-TREE 2 — Maximum Likelihood Tree
|
||
|
||
```python
|
||
def run_iqtree(aligned_fasta: str, output_prefix: str,
|
||
model: str = "TEST", bootstrap: int = 1000,
|
||
n_threads: int = 4, extra_args: list = None) -> dict:
|
||
"""
|
||
Build a maximum likelihood tree with IQ-TREE 2.
|
||
|
||
Args:
|
||
aligned_fasta: Aligned FASTA file
|
||
output_prefix: Prefix for output files
|
||
model: 'TEST' for automatic model selection, or specify (e.g., 'GTR+G' for DNA,
|
||
'LG+G4' for proteins, 'JTT+G' for proteins)
|
||
bootstrap: Number of ultrafast bootstrap replicates (1000 recommended)
|
||
n_threads: Number of threads ('AUTO' to auto-detect)
|
||
extra_args: Additional IQ-TREE arguments
|
||
|
||
Returns:
|
||
Dict with paths to output files
|
||
"""
|
||
cmd = [
|
||
"iqtree2",
|
||
"-s", aligned_fasta,
|
||
"--prefix", output_prefix,
|
||
"-m", model,
|
||
"-B", str(bootstrap), # Ultrafast bootstrap
|
||
"-T", str(n_threads),
|
||
"--redo" # Overwrite existing results
|
||
]
|
||
|
||
if extra_args:
|
||
cmd.extend(extra_args)
|
||
|
||
result = subprocess.run(cmd, capture_output=True, text=True)
|
||
|
||
if result.returncode != 0:
|
||
raise RuntimeError(f"IQ-TREE failed:\n{result.stderr}")
|
||
|
||
# Print model selection result
|
||
log_file = f"{output_prefix}.log"
|
||
if os.path.exists(log_file):
|
||
with open(log_file) as f:
|
||
for line in f:
|
||
if "Best-fit model" in line:
|
||
print(f"IQ-TREE: {line.strip()}")
|
||
|
||
output_files = {
|
||
"tree": f"{output_prefix}.treefile",
|
||
"log": f"{output_prefix}.log",
|
||
"iqtree": f"{output_prefix}.iqtree", # Full report
|
||
"model": f"{output_prefix}.model.gz",
|
||
}
|
||
|
||
print(f"IQ-TREE: Tree saved to {output_files['tree']}")
|
||
return output_files
|
||
|
||
# IQ-TREE model selection guide:
|
||
# DNA: TEST → GTR+G, HKY+G, TrN+G
|
||
# Protein: TEST → LG+G4, WAG+G, JTT+G, Q.pfam+G
|
||
# Codon: TEST → MG+F3X4
|
||
|
||
# For temporal (molecular clock) analysis, add:
|
||
# extra_args = ["--date", "dates.txt", "--clock-test", "--date-CI", "95"]
|
||
```
|
||
|
||
### 4. FastTree — Fast Approximate ML
|
||
|
||
For large datasets (>1000 sequences) where IQ-TREE is too slow:
|
||
|
||
```python
|
||
def run_fasttree(aligned_fasta: str, output_tree: str,
|
||
sequence_type: str = "nt", model: str = "gtr",
|
||
n_threads: int = 4) -> str:
|
||
"""
|
||
Build a fast approximate ML tree with FastTree.
|
||
|
||
Args:
|
||
sequence_type: 'nt' for nucleotide or 'aa' for amino acid
|
||
model: For nt: 'gtr' (recommended) or 'jc'; for aa: 'lg', 'wag', 'jtt'
|
||
"""
|
||
if sequence_type == "nt":
|
||
cmd = ["FastTree", "-nt", "-gtr"]
|
||
else:
|
||
cmd = ["FastTree", f"-{model}"]
|
||
|
||
cmd += [aligned_fasta]
|
||
|
||
with open(output_tree, 'w') as out:
|
||
result = subprocess.run(cmd, stdout=out, stderr=subprocess.PIPE, text=True)
|
||
|
||
if result.returncode != 0:
|
||
raise RuntimeError(f"FastTree failed:\n{result.stderr}")
|
||
|
||
print(f"FastTree: Tree saved to {output_tree}")
|
||
return output_tree
|
||
```
|
||
|
||
### 5. Tree Analysis and Visualization with ETE3
|
||
|
||
```python
|
||
from ete3 import Tree, TreeStyle, NodeStyle, TextFace, PhyloTree
|
||
import matplotlib.pyplot as plt
|
||
|
||
def load_tree(tree_file: str) -> Tree:
|
||
"""Load a Newick tree file."""
|
||
t = Tree(tree_file)
|
||
print(f"Tree: {len(t)} leaves, {len(list(t.traverse()))} nodes")
|
||
return t
|
||
|
||
def basic_tree_stats(t: Tree) -> dict:
|
||
"""Compute basic tree statistics."""
|
||
leaves = t.get_leaves()
|
||
distances = [t.get_distance(l1, l2) for l1 in leaves[:min(50, len(leaves))]
|
||
for l2 in leaves[:min(50, len(leaves))] if l1 != l2]
|
||
|
||
stats = {
|
||
"n_leaves": len(leaves),
|
||
"n_internal_nodes": len(t) - len(leaves),
|
||
"total_branch_length": sum(n.dist for n in t.traverse()),
|
||
"max_leaf_distance": max(distances) if distances else 0,
|
||
"mean_leaf_distance": sum(distances)/len(distances) if distances else 0,
|
||
}
|
||
return stats
|
||
|
||
def find_mrca(t: Tree, leaf_names: list) -> Tree:
|
||
"""Find the most recent common ancestor of a set of leaves."""
|
||
return t.get_common_ancestor(*leaf_names)
|
||
|
||
def visualize_tree(t: Tree, output_file: str = "tree.png",
|
||
show_branch_support: bool = True,
|
||
color_groups: dict = None,
|
||
width: int = 800) -> None:
|
||
"""
|
||
Render phylogenetic tree to image.
|
||
|
||
Args:
|
||
t: ETE3 Tree object
|
||
color_groups: Dict mapping leaf_name → color (for coloring taxa)
|
||
show_branch_support: Show bootstrap values
|
||
"""
|
||
ts = TreeStyle()
|
||
ts.show_leaf_name = True
|
||
ts.show_branch_support = show_branch_support
|
||
ts.mode = "r" # 'r' = rectangular, 'c' = circular
|
||
|
||
if color_groups:
|
||
for node in t.traverse():
|
||
if node.is_leaf() and node.name in color_groups:
|
||
nstyle = NodeStyle()
|
||
nstyle["fgcolor"] = color_groups[node.name]
|
||
nstyle["size"] = 8
|
||
node.set_style(nstyle)
|
||
|
||
t.render(output_file, tree_style=ts, w=width, units="px")
|
||
print(f"Tree saved to: {output_file}")
|
||
|
||
def midpoint_root(t: Tree) -> Tree:
|
||
"""Root tree at midpoint (use when outgroup unknown)."""
|
||
t.set_outgroup(t.get_midpoint_outgroup())
|
||
return t
|
||
|
||
def prune_tree(t: Tree, keep_leaves: list) -> Tree:
|
||
"""Prune tree to keep only specified leaves."""
|
||
t.prune(keep_leaves, preserve_branch_length=True)
|
||
return t
|
||
```
|
||
|
||
### 6. Complete Analysis Script
|
||
|
||
```python
|
||
import subprocess, os
|
||
from ete3 import Tree
|
||
|
||
def full_phylogenetic_analysis(
|
||
input_fasta: str,
|
||
output_dir: str = "phylo_results",
|
||
sequence_type: str = "nt",
|
||
n_threads: int = 4,
|
||
bootstrap: int = 1000,
|
||
use_fasttree: bool = False
|
||
) -> dict:
|
||
"""
|
||
Complete phylogenetic pipeline: align → trim → tree → visualize.
|
||
|
||
Args:
|
||
input_fasta: Unaligned FASTA
|
||
sequence_type: 'nt' (nucleotide) or 'aa' (amino acid/protein)
|
||
use_fasttree: Use FastTree instead of IQ-TREE (faster for large datasets)
|
||
"""
|
||
os.makedirs(output_dir, exist_ok=True)
|
||
prefix = os.path.join(output_dir, "phylo")
|
||
|
||
print("=" * 50)
|
||
print("Step 1: Multiple Sequence Alignment (MAFFT)")
|
||
aligned = run_mafft(input_fasta, f"{prefix}_aligned.fasta",
|
||
method="auto", n_threads=n_threads)
|
||
|
||
print("\nStep 2: Tree Inference")
|
||
if use_fasttree:
|
||
tree_file = run_fasttree(
|
||
aligned, f"{prefix}.tree",
|
||
sequence_type=sequence_type,
|
||
model="gtr" if sequence_type == "nt" else "lg"
|
||
)
|
||
else:
|
||
model = "TEST" if sequence_type == "nt" else "TEST"
|
||
iqtree_files = run_iqtree(
|
||
aligned, prefix,
|
||
model=model,
|
||
bootstrap=bootstrap,
|
||
n_threads=n_threads
|
||
)
|
||
tree_file = iqtree_files["tree"]
|
||
|
||
print("\nStep 3: Tree Analysis")
|
||
t = Tree(tree_file)
|
||
t = midpoint_root(t)
|
||
|
||
stats = basic_tree_stats(t)
|
||
print(f"Tree statistics: {stats}")
|
||
|
||
print("\nStep 4: Visualization")
|
||
visualize_tree(t, f"{prefix}_tree.png", show_branch_support=True)
|
||
|
||
# Save rooted tree
|
||
rooted_tree_file = f"{prefix}_rooted.nwk"
|
||
t.write(format=1, outfile=rooted_tree_file)
|
||
|
||
results = {
|
||
"aligned_fasta": aligned,
|
||
"tree_file": tree_file,
|
||
"rooted_tree": rooted_tree_file,
|
||
"visualization": f"{prefix}_tree.png",
|
||
"stats": stats
|
||
}
|
||
|
||
print("\n" + "=" * 50)
|
||
print("Phylogenetic analysis complete!")
|
||
print(f"Results in: {output_dir}/")
|
||
return results
|
||
```
|
||
|
||
## IQ-TREE Model Guide
|
||
|
||
### DNA Models
|
||
|
||
| Model | Description | Use case |
|
||
|-------|-------------|---------|
|
||
| `GTR+G4` | General Time Reversible + Gamma | Most flexible DNA model |
|
||
| `HKY+G4` | Hasegawa-Kishino-Yano + Gamma | Two-rate model (common) |
|
||
| `TrN+G4` | Tamura-Nei | Unequal transitions |
|
||
| `JC` | Jukes-Cantor | Simplest; all rates equal |
|
||
|
||
### Protein Models
|
||
|
||
| Model | Description | Use case |
|
||
|-------|-------------|---------|
|
||
| `LG+G4` | Le-Gascuel + Gamma | Best average protein model |
|
||
| `WAG+G4` | Whelan-Goldman | Widely used |
|
||
| `JTT+G4` | Jones-Taylor-Thornton | Classical model |
|
||
| `Q.pfam+G4` | pfam-trained | For Pfam-like protein families |
|
||
| `Q.bird+G4` | Bird-specific | Vertebrate proteins |
|
||
|
||
**Tip:** Use `-m TEST` to let IQ-TREE automatically select the best model.
|
||
|
||
## Best Practices
|
||
|
||
- **Alignment quality first**: Poor alignment → unreliable trees; check alignment manually
|
||
- **Use `linsi` for small (<200 seq), `fftns` or `auto` for large alignments**
|
||
- **Model selection**: Always use `-m TEST` for IQ-TREE unless you have a specific reason
|
||
- **Bootstrap**: Use ≥1000 ultrafast bootstraps (`-B 1000`) for branch support
|
||
- **Root the tree**: Unrooted trees can be misleading; use outgroup or midpoint rooting
|
||
- **FastTree for >5000 sequences**: IQ-TREE becomes slow; FastTree is 10–100× faster
|
||
- **Trim long alignments**: TrimAl removes unreliable columns; improves tree accuracy
|
||
- **Check for recombination** in viral/bacterial sequences before building trees (`RDP4`, `GARD`)
|
||
|
||
## Additional Resources
|
||
|
||
- **MAFFT**: https://mafft.cbrc.jp/alignment/software/
|
||
- **IQ-TREE 2**: http://www.iqtree.org/ | Tutorial: https://www.iqtree.org/workshop/molevol2022
|
||
- **FastTree**: http://www.microbesonline.org/fasttree/
|
||
- **ETE3**: http://etetoolkit.org/
|
||
- **FigTree** (GUI visualization): https://tree.bio.ed.ac.uk/software/figtree/
|
||
- **iTOL** (web visualization): https://itol.embl.de/
|
||
- **MUSCLE** (alternative aligner): https://www.drive5.com/muscle/
|
||
- **TrimAl** (alignment trimming): https://vicfero.github.io/trimal/
|