claude-scientific-skills/scientific-skills/phylogenetics/references/iqtree_inference.md

# IQ-TREE 2 Phylogenetic Inference Reference

## Basic Command Syntax

```bash
iqtree2 -s alignment.fasta --prefix output -m TEST -B 1000 -T AUTO --redo
```

## Key Parameters

| Flag | Description | Default |
|------|-------------|---------|
| `-s` | Input alignment file | Required |
| `--prefix` | Output file prefix | alignment name |
| `-m` | Substitution model (or TEST) | GTR+G |
| `-B` | Ultrafast bootstrap replicates | Off |
| `-b` | Standard bootstrap replicates (slow) | Off |
| `-T` | Number of threads (or AUTO) | 1 |
| `-o` | Outgroup taxa name(s) | None (unrooted) |
| `--redo` | Overwrite existing results | Off |
| `-alrt` | SH-aLRT test replicates | Off |

## Model Selection

```bash
# Full model testing (automatically selects best model)
iqtree2 -s alignment.fasta -m TEST --prefix test_run -B 1000 -T 4

# Specify model explicitly
iqtree2 -s alignment.fasta -m GTR+G4 --prefix gtr_run -B 1000

# Protein sequences
iqtree2 -s protein.fasta -m TEST --prefix prot_tree -B 1000

# Codon-based analysis
iqtree2 -s codon.fasta -m GY --prefix codon_tree -B 1000
```

## Bootstrapping Methods

### Ultrafast Bootstrap (UFBoot, recommended)
```bash
iqtree2 -s alignment.fasta -B 1000  # 1000 replicates
# Values ≥95 are reliable
# ~10× faster than standard bootstrap
```

### Standard Bootstrap
```bash
iqtree2 -s alignment.fasta -b 100  # 100 replicates (very slow)
```

### SH-aLRT Test (fast alternative)
```bash
iqtree2 -s alignment.fasta -alrt 1000 -B 1000  # Both SH-aLRT and UFBoot
# SH-aLRT ≥80 AND UFBoot ≥95 = well-supported branch
```

## Branch Support Interpretation

| Bootstrap Value | Interpretation |
|----------------|----------------|
| ≥ 95 | Well-supported (strongly supported) |
| 70–94 | Moderately supported |
| 50–69 | Weakly supported |
| < 50 | Unreliable (not supported) |

## Output Files

| File | Description |
|------|-------------|
| `{prefix}.treefile` | Best ML tree in Newick format |
| `{prefix}.iqtree` | Full analysis report |
| `{prefix}.log` | Computation log |
| `{prefix}.contree` | Consensus tree from bootstrap |
| `{prefix}.splits.nex` | Network splits |
| `{prefix}.bionj` | BioNJ starting tree |
| `{prefix}.model.gz` | Saved model parameters |

## Advanced Analyses

### Molecular Clock (Dating)

```bash
# Temporal analysis with sampling dates
iqtree2 -s alignment.fasta -m GTR+G \
        --date dates.tsv \           # Tab-separated: taxon_name  YYYY-MM-DD
        --clock-test \               # Test for clock-like evolution
        --date-CI 95 \              # 95% CI for node dates
        --prefix dated_tree
```

### Concordance Factors

```bash
# Gene concordance factor (gCF) - requires multiple gene alignments
iqtree2 --gcf gene_trees.nwk \
        --tree main_tree.treefile \
        --cf-verbose \
        --prefix cf_analysis
```

### Ancestral Sequence Reconstruction

```bash
iqtree2 -s alignment.fasta -m LG+G4 \
        -asr \                      # Marginal ancestral state reconstruction
        --prefix anc_tree
# Output: {prefix}.state (ancestral sequences per node)
```

### Partition Model (Multi-Gene)

```bash
# Create partition file (partitions.txt):
# DNA, gene1 = 1-500
# DNA, gene2 = 501-1000

iqtree2 -s concat_alignment.fasta \
        -p partitions.txt \
        -m TEST \
        -B 1000 \
        --prefix partition_tree
```

## IQ-TREE Log Parsing

```python
def parse_iqtree_log(log_file: str) -> dict:
    """Extract key results from IQ-TREE log file."""
    results = {}
    with open(log_file) as f:
        for line in f:
            if "Best-fit model" in line:
                results["best_model"] = line.split(":")[1].strip()
            elif "Log-likelihood of the tree:" in line:
                results["log_likelihood"] = float(line.split(":")[1].strip())
            elif "Number of free parameters" in line:
                results["free_params"] = int(line.split(":")[1].strip())
            elif "Akaike information criterion" in line:
                results["AIC"] = float(line.split(":")[1].strip())
            elif "Bayesian information criterion" in line:
                results["BIC"] = float(line.split(":")[1].strip())
            elif "Total CPU time used" in line:
                results["cpu_time"] = line.split(":")[1].strip()
    return results

# Example:
# results = parse_iqtree_log("output.log")
# print(f"Best model: {results['best_model']}")
# print(f"Log-likelihood: {results['log_likelihood']:.2f}")
```

## Common Issues and Solutions

| Issue | Likely Cause | Solution |
|-------|-------------|---------|
| All bootstrap values = 0 | Too few taxa | Need ≥4 taxa for bootstrap |
| Very long branches | Alignment artifacts | Re-trim alignment; check for outliers |
| Memory error | Too many sequences | Use FastTree; or reduce `-T` to 1 |
| Poor model fit | Wrong alphabet | Check nucleotide vs. protein specification |
| Identical sequences | Duplicate sequences | Remove duplicates before alignment |

## MAFFT Alignment Guide

```bash
# Accurate (< 200 sequences)
mafft --localpair --maxiterate 1000 input.fasta > aligned.fasta

# Medium (200-1000 sequences)
mafft --auto input.fasta > aligned.fasta

# Fast (> 1000 sequences)
mafft --fftns input.fasta > aligned.fasta

# Very large (> 10000 sequences)
mafft --retree 1 input.fasta > aligned.fasta

# Using multiple threads
mafft --thread 8 --auto input.fasta > aligned.fasta
```