Add scVelo RNA velocity analysis workflow and IQ-TREE reference documentation

- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization.
- Added a script for running RNA velocity analysis with customizable parameters and output options.
- Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation.
- Included references for velocity models and their mathematical framework, along with a comparison of different models.
- Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
This commit is contained in:
huangkuanlin
2026-03-03 07:15:36 -05:00
parent b271271df4
commit 7f94783fab
27 changed files with 6961 additions and 0 deletions

View File

@@ -0,0 +1,321 @@
---
name: scvelo
description: RNA velocity analysis with scVelo. Estimate cell state transitions from unspliced/spliced mRNA dynamics, infer trajectory directions, compute latent time, and identify driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for trajectory inference.
license: BSD-3-Clause
metadata:
skill-author: Kuan-lin Huang
---
# scVelo — RNA Velocity Analysis
## Overview
scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.
**Installation:** `pip install scvelo`
**Key resources:**
- Documentation: https://scvelo.readthedocs.io/
- GitHub: https://github.com/theislab/scvelo
- Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759
## When to Use This Skill
Use scVelo when:
- **Trajectory inference from snapshot data**: Determine which direction cells are differentiating
- **Cell fate prediction**: Identify progenitor cells and their downstream fates
- **Driver gene identification**: Find genes whose dynamics best explain observed trajectories
- **Developmental biology**: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions
- **Latent time estimation**: Order cells along a pseudotime derived from splicing dynamics
- **Complement to Scanpy**: Add directional information to UMAP embeddings
## Prerequisites
scVelo requires count matrices for both **unspliced** and **spliced** RNA. These are generated by:
1. **STARsolo** or **kallisto|bustools** with `lamanno` mode
2. **velocyto** CLI: `velocyto run10x` / `velocyto run`
3. **alevin-fry** / **simpleaf** with spliced/unspliced output
Data is stored in an `AnnData` object with `layers["spliced"]` and `layers["unspliced"]`.
## Standard RNA Velocity Workflow
### 1. Setup and Data Loading
```python
import scvelo as scv
import scanpy as sc
import numpy as np
import matplotlib.pyplot as plt
# Configure settings
scv.settings.verbosity = 3 # Show computation steps
scv.settings.presenter_view = True
scv.settings.set_figure_params('scvelo')
# Load data (AnnData with spliced/unspliced layers)
# Option A: Load from loom (velocyto output)
adata = scv.read("cellranger_output.loom", cache=True)
# Option B: Merge velocyto loom with Scanpy-processed AnnData
adata_processed = sc.read_h5ad("processed.h5ad") # Has UMAP, clusters
adata_velocity = scv.read("velocyto.loom")
adata = scv.utils.merge(adata_processed, adata_velocity)
# Verify layers
print(adata)
# obs × var: N × G
# layers: 'spliced', 'unspliced' (required)
# obsm['X_umap'] (required for visualization)
```
### 2. Preprocessing
```python
# Filter and normalize (follows Scanpy conventions)
scv.pp.filter_and_normalize(
adata,
min_shared_counts=20, # Minimum counts in spliced+unspliced
n_top_genes=2000 # Top highly variable genes
)
# Compute first and second order moments (means and variances)
# knn_connectivities must be computed first
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30)
scv.pp.moments(
adata,
n_pcs=30,
n_neighbors=30
)
```
### 3. Velocity Estimation — Stochastic Model
The stochastic model is fast and suitable for exploratory analysis:
```python
# Stochastic velocity (faster, less accurate)
scv.tl.velocity(adata, mode='stochastic')
scv.tl.velocity_graph(adata)
# Visualize
scv.pl.velocity_embedding_stream(
adata,
basis='umap',
color='leiden',
title="RNA Velocity (Stochastic)"
)
```
### 4. Velocity Estimation — Dynamical Model (Recommended)
The dynamical model fits the full splicing kinetics and is more accurate:
```python
# Recover dynamics (computationally intensive; ~10-30 min for 10K cells)
scv.tl.recover_dynamics(adata, n_jobs=4)
# Compute velocity from dynamical model
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
```
### 5. Latent Time
The dynamical model enables computation of a shared latent time (pseudotime):
```python
# Compute latent time
scv.tl.latent_time(adata)
# Visualize latent time on UMAP
scv.pl.scatter(
adata,
color='latent_time',
color_map='gnuplot',
size=80,
title='Latent time'
)
# Identify top genes ordered by latent time
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
scv.pl.heatmap(
adata,
var_names=top_genes,
sortby='latent_time',
col_color='leiden',
n_convolve=100
)
```
### 6. Driver Gene Analysis
```python
# Identify genes with highest velocity fit
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
print(df.head(10))
# Speed and coherence
scv.tl.velocity_confidence(adata)
scv.pl.scatter(
adata,
c=['velocity_length', 'velocity_confidence'],
cmap='coolwarm',
perc=[5, 95]
)
# Phase portraits for specific genes
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'],
ncols=3, figsize=(16, 4))
```
### 7. Velocity Arrows and Pseudotime
```python
# Arrow plot on UMAP
scv.pl.velocity_embedding(
adata,
arrow_length=3,
arrow_size=2,
color='leiden',
basis='umap'
)
# Stream plot (cleaner visualization)
scv.pl.velocity_embedding_stream(
adata,
basis='umap',
color='leiden',
smooth=0.8,
min_mass=4
)
# Velocity pseudotime (alternative to latent time)
scv.tl.velocity_pseudotime(adata)
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
```
### 8. PAGA Trajectory Graph
```python
# PAGA graph with velocity-informed transitions
scv.tl.paga(adata, groups='leiden')
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
df.style.background_gradient(cmap='Blues').format('{:.2g}')
# Plot PAGA with velocity
scv.pl.paga(
adata,
basis='umap',
size=50,
alpha=0.1,
min_edge_width=2,
node_size_scale=1.5
)
```
## Complete Workflow Script
```python
import scvelo as scv
import scanpy as sc
def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4):
"""
Complete RNA velocity workflow.
Args:
adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
n_top_genes: Number of top HVGs for velocity
mode: 'stochastic' (fast) or 'dynamical' (accurate)
n_jobs: Parallel jobs for dynamical model
Returns:
Processed AnnData with velocity information
"""
scv.settings.verbosity = 2
# 1. Preprocessing
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)
if 'neighbors' not in adata.uns:
sc.pp.neighbors(adata, n_neighbors=30)
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
# 2. Velocity estimation
if mode == 'dynamical':
scv.tl.recover_dynamics(adata, n_jobs=n_jobs)
scv.tl.velocity(adata, mode=mode)
scv.tl.velocity_graph(adata)
# 3. Downstream analyses
if mode == 'dynamical':
scv.tl.latent_time(adata)
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
scv.tl.velocity_confidence(adata)
scv.tl.velocity_pseudotime(adata)
return adata
```
## Key Output Fields in AnnData
After running the workflow, the following fields are added:
| Location | Key | Description |
|----------|-----|-------------|
| `adata.layers` | `velocity` | RNA velocity per gene per cell |
| `adata.layers` | `fit_t` | Fitted latent time per gene per cell |
| `adata.obsm` | `velocity_umap` | 2D velocity vectors on UMAP |
| `adata.obs` | `velocity_pseudotime` | Pseudotime from velocity |
| `adata.obs` | `latent_time` | Latent time from dynamical model |
| `adata.obs` | `velocity_length` | Speed of each cell |
| `adata.obs` | `velocity_confidence` | Confidence score per cell |
| `adata.var` | `fit_likelihood` | Gene-level model fit quality |
| `adata.var` | `fit_alpha` | Transcription rate |
| `adata.var` | `fit_beta` | Splicing rate |
| `adata.var` | `fit_gamma` | Degradation rate |
| `adata.uns` | `velocity_graph` | Cell-cell transition probability matrix |
## Velocity Models Comparison
| Model | Speed | Accuracy | When to Use |
|-------|-------|----------|-------------|
| `stochastic` | Fast | Moderate | Exploratory; large datasets |
| `deterministic` | Medium | Moderate | Simple linear kinetics |
| `dynamical` | Slow | High | Publication-quality; identifies driver genes |
## Best Practices
- **Start with stochastic mode** for exploration; switch to dynamical for final analysis
- **Need good coverage of unspliced reads**: Short reads (< 100 bp) may miss intron coverage
- **Minimum 2,000 cells**: RNA velocity is noisy with fewer cells
- **Velocity should be coherent**: Arrows should follow known biology; randomness indicates issues
- **k-NN bandwidth matters**: Too few neighbors → noisy velocity; too many → oversmoothed
- **Sanity check**: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
- **Dynamical model requires distinct kinetic states**: Works best for clear differentiation processes
## Troubleshooting
| Problem | Solution |
|---------|---------|
| Missing unspliced layer | Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto` |
| Very few velocity genes | Lower `min_shared_counts`; check sequencing depth |
| Random-looking arrows | Try different `n_neighbors` or velocity model |
| Memory error with dynamical | Set `n_jobs=1`; reduce `n_top_genes` |
| Negative velocity everywhere | Check that spliced/unspliced layers are not swapped |
## Additional Resources
- **scVelo documentation**: https://scvelo.readthedocs.io/
- **Tutorial notebooks**: https://scvelo.readthedocs.io/tutorials/
- **GitHub**: https://github.com/theislab/scvelo
- **Paper**: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
- **velocyto** (preprocessing): http://velocyto.org/
- **CellRank** (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
- **dynamo** (metabolic labeling alternative): https://dynamo-release.readthedocs.io/

View File

@@ -0,0 +1,168 @@
# scVelo Velocity Models Reference
## Mathematical Framework
RNA velocity is based on the kinetic model of transcription:
```
dx_s/dt = β·x_u - γ·x_s (spliced dynamics)
dx_u/dt = α(t) - β·x_u (unspliced dynamics)
```
Where:
- `x_s`: spliced mRNA abundance
- `x_u`: unspliced (pre-mRNA) abundance
- `α(t)`: transcription rate (varies over time)
- `β`: splicing rate
- `γ`: degradation rate
**Velocity** is defined as: `v = dx_s/dt = β·x_u - γ·x_s`
- **v > 0**: Gene is being upregulated (more unspliced than expected at steady state)
- **v < 0**: Gene is being downregulated (less unspliced than expected)
## Model Comparison
### Steady-State (Velocyto, original)
- Assumes constant α (transcription rate)
- Fits γ using linear regression on steady-state cells
- **Limitation**: Requires identifiable steady states; assumes constant transcription
```python
# Use with scVelo for backward compatibility
scv.tl.velocity(adata, mode='steady_state')
```
### Stochastic Model (scVelo v1)
- Extends steady-state with variance/covariance terms
- Models cell-to-cell variability in mRNA counts
- More robust to noise than steady-state
```python
scv.tl.velocity(adata, mode='stochastic')
```
### Dynamical Model (scVelo v2, recommended)
- Jointly estimates all kinetic rates (α, β, γ) and cell-specific latent time
- Does not assume steady state
- Identifies induction vs. repression phases
- Computes fit_likelihood per gene (quality measure)
```python
scv.tl.recover_dynamics(adata, n_jobs=4)
scv.tl.velocity(adata, mode='dynamical')
```
**Kinetic states identified by dynamical model:**
| State | Description |
|-------|-------------|
| Induction | α > 0, x_u increasing |
| Steady-state on | α > 0, constant high expression |
| Repression | α = 0, x_u decreasing |
| Steady-state off | α = 0, constant low expression |
## Velocity Graph
The velocity graph connects cells based on their velocity similarity to neighboring cells' states:
```python
scv.tl.velocity_graph(adata)
# Stored in adata.uns['velocity_graph']
# Entry [i,j] = probability that cell i transitions to cell j
```
**Parameters:**
- `n_neighbors`: Number of neighbors considered
- `sqrt_transform`: Apply sqrt transform to data (default: False for spliced)
- `approx`: Use approximate nearest neighbor search (faster for large datasets)
## Latent Time Interpretation
Latent time τ ∈ [0, 1] for each gene represents:
- τ = 0: Gene is at onset of induction
- τ = 0.5: Gene is at peak of induction (for a complete cycle)
- τ = 1: Gene has returned to steady-state off
**Shared latent time** is computed by taking the average over all velocity genes, weighted by fit_likelihood.
## Quality Metrics
### Gene-level
- `fit_likelihood`: Goodness-of-fit of dynamical model (0-1; higher = better)
- Use for filtering driver genes: `adata.var[adata.var['fit_likelihood'] > 0.1]`
- `fit_alpha`: Transcription rate during induction
- `fit_gamma`: mRNA degradation rate
- `fit_r2`: R² of kinetic fit
### Cell-level
- `velocity_length`: Magnitude of velocity vector (cell speed)
- `velocity_confidence`: Coherence of velocity with neighboring cells (0-1)
### Dataset-level
```python
# Check overall velocity quality
scv.pl.proportions(adata) # Ratio of spliced/unspliced per cell
scv.pl.velocity_confidence(adata, groupby='leiden')
```
## Parameter Tuning Guide
| Parameter | Function | Default | When to Change |
|-----------|----------|---------|----------------|
| `min_shared_counts` | Filter genes | 20 | Increase for deep sequencing; decrease for shallow |
| `n_top_genes` | HVG selection | 2000 | Increase for complex datasets |
| `n_neighbors` | kNN graph | 30 | Decrease for small datasets; increase for noisy |
| `n_pcs` | PCA dimensions | 30 | Match to elbow in scree plot |
| `t_max_rank` | Latent time constraint | None | Set if known developmental direction |
## Integration with Other Tools
### CellRank (Fate Prediction)
```python
import cellrank as cr
from cellrank.kernels import VelocityKernel, ConnectivityKernel
# Combine velocity and connectivity kernels
vk = VelocityKernel(adata).compute_transition_matrix()
ck = ConnectivityKernel(adata).compute_transition_matrix()
combined = 0.8 * vk + 0.2 * ck
# Compute macrostates (terminal and initial states)
g = cr.estimators.GPCCA(combined)
g.compute_macrostates(n_states=4, cluster_key='leiden')
g.plot_macrostates(which="all")
# Compute fate probabilities
g.compute_fate_probabilities()
g.plot_fate_probabilities()
```
### Scanpy Integration
scVelo works natively with Scanpy's AnnData:
```python
import scanpy as sc
import scvelo as scv
# Run standard Scanpy pipeline first
sc.pp.normalize_total(adata)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata)
# Then add velocity on top
scv.pp.moments(adata)
scv.tl.recover_dynamics(adata)
scv.tl.velocity(adata, mode='dynamical')
scv.tl.velocity_graph(adata)
scv.tl.latent_time(adata)
```

View File

@@ -0,0 +1,232 @@
"""
RNA Velocity Analysis Workflow using scVelo
===========================================
Complete pipeline from raw data to velocity visualization.
Usage:
python rna_velocity_workflow.py
Or import and use run_velocity_analysis() with your AnnData object.
"""
import scvelo as scv
import scanpy as sc
import numpy as np
import matplotlib
matplotlib.use('Agg') # Non-interactive backend
import matplotlib.pyplot as plt
import os
def run_velocity_analysis(
adata,
groupby="leiden",
n_top_genes=2000,
n_neighbors=30,
mode="dynamical",
n_jobs=4,
output_dir="velocity_results",
):
"""
Complete RNA velocity analysis workflow.
Parameters
----------
adata : AnnData
AnnData object with 'spliced' and 'unspliced' layers.
Should already have UMAP and cluster annotations.
groupby : str
Column in adata.obs for cell type labels.
n_top_genes : int
Number of top highly variable genes.
n_neighbors : int
Number of neighbors for moment computation.
mode : str
Velocity model: 'stochastic' (fast) or 'dynamical' (accurate).
n_jobs : int
Parallel jobs for dynamical model fitting.
output_dir : str
Directory for saving output figures.
Returns
-------
AnnData with velocity annotations.
"""
os.makedirs(output_dir, exist_ok=True)
# ── Settings ──────────────────────────────────────────────────────────────
scv.settings.verbosity = 2
scv.settings.figdir = output_dir
# ── Step 1: Check layers ───────────────────────────────────────────────────
assert "spliced" in adata.layers, "Missing 'spliced' layer. Run velocyto first."
assert "unspliced" in adata.layers, "Missing 'unspliced' layer. Run velocyto first."
print(f"Input: {adata.n_obs} cells × {adata.n_vars} genes")
# ── Step 2: Preprocessing ─────────────────────────────────────────────────
print("Step 1/5: Preprocessing...")
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)
if "neighbors" not in adata.uns:
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=30)
scv.pp.moments(adata, n_pcs=30, n_neighbors=n_neighbors)
print(f" {adata.n_vars} velocity genes selected")
# ── Step 3: Velocity estimation ────────────────────────────────────────────
print(f"Step 2/5: Fitting velocity model ({mode})...")
if mode == "dynamical":
scv.tl.recover_dynamics(adata, n_jobs=n_jobs)
scv.tl.velocity(adata, mode=mode)
scv.tl.velocity_graph(adata)
print(" Velocity graph computed")
# ── Step 4: Downstream analyses ────────────────────────────────────────────
print("Step 3/5: Computing latent time and confidence...")
scv.tl.velocity_confidence(adata)
scv.tl.velocity_pseudotime(adata)
if mode == "dynamical":
scv.tl.latent_time(adata)
if groupby in adata.obs.columns:
scv.tl.rank_velocity_genes(adata, groupby=groupby, min_corr=0.3)
# ── Step 5: Visualization ─────────────────────────────────────────────────
print("Step 4/5: Generating figures...")
# Stream plot
scv.pl.velocity_embedding_stream(
adata,
basis="umap",
color=groupby,
title="RNA Velocity",
save=f"{output_dir}/velocity_stream.png",
)
# Arrow plot
scv.pl.velocity_embedding(
adata,
arrow_length=3,
arrow_size=2,
color=groupby,
basis="umap",
save=f"{output_dir}/velocity_arrows.png",
)
# Pseudotime
scv.pl.scatter(
adata,
color="velocity_pseudotime",
cmap="gnuplot",
title="Velocity Pseudotime",
save=f"{output_dir}/pseudotime.png",
)
if mode == "dynamical" and "latent_time" in adata.obs:
scv.pl.scatter(
adata,
color="latent_time",
color_map="gnuplot",
title="Latent Time",
save=f"{output_dir}/latent_time.png",
)
# Speed and coherence
scv.pl.scatter(
adata,
c=["velocity_length", "velocity_confidence"],
cmap="coolwarm",
perc=[5, 95],
save=f"{output_dir}/velocity_quality.png",
)
# Top driver genes heatmap (dynamical only)
if mode == "dynamical" and "fit_likelihood" in adata.var:
top_genes = adata.var["fit_likelihood"].sort_values(ascending=False).index[:50]
scv.pl.heatmap(
adata,
var_names=top_genes,
sortby="latent_time",
col_color=groupby,
n_convolve=50,
save=f"{output_dir}/driver_gene_heatmap.png",
)
# ── Step 6: Save results ───────────────────────────────────────────────────
print("Step 5/5: Saving results...")
output_h5ad = os.path.join(output_dir, "adata_velocity.h5ad")
adata.write_h5ad(output_h5ad)
print(f" Saved to {output_h5ad}")
# Summary statistics
confidence = adata.obs["velocity_confidence"].dropna()
print("\nSummary:")
print(f" Velocity model: {mode}")
print(f" Cells: {adata.n_obs}")
print(f" Velocity genes: {adata.n_vars}")
print(f" Mean velocity confidence: {confidence.mean():.3f}")
print(f" High-confidence cells (>0.7): {(confidence > 0.7).sum()} ({(confidence > 0.7).mean():.1%})")
if mode == "dynamical" and "fit_likelihood" in adata.var:
good_genes = (adata.var["fit_likelihood"] > 0.1).sum()
print(f" Well-fit genes (likelihood>0.1): {good_genes}")
print(f"\nOutput files saved to: {output_dir}/")
return adata
def load_from_loom(loom_path, processed_h5ad=None):
"""
Load velocity data from velocyto loom file.
Args:
loom_path: Path to velocyto output loom file
processed_h5ad: Optional path to pre-processed Scanpy h5ad file
"""
adata_loom = scv.read(loom_path, cache=True)
if processed_h5ad:
adata_processed = sc.read_h5ad(processed_h5ad)
# Merge: keep processed metadata and add velocity layers
adata = scv.utils.merge(adata_processed, adata_loom)
else:
adata = adata_loom
# Run basic Scanpy pipeline
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.tl.leiden(adata, resolution=0.5)
return adata
if __name__ == "__main__":
# Example usage with simulated data (for testing)
print("scVelo RNA Velocity Workflow - Demo Mode")
print("=" * 50)
# Load example dataset
adata = scv.datasets.pancreas()
print(f"Loaded pancreas dataset: {adata}")
# Run analysis
adata = run_velocity_analysis(
adata,
groupby="clusters",
n_top_genes=2000,
mode="dynamical",
n_jobs=2,
output_dir="pancreas_velocity",
)
print("\nAnalysis complete!")
print(f"Key results:")
print(f" adata.layers['velocity']: velocity per gene per cell")
print(f" adata.obs['latent_time']: pseudotime from dynamics")
print(f" adata.obs['velocity_confidence']: per-cell confidence")
if "rank_velocity_genes" in adata.uns:
print(f" adata.uns['rank_velocity_genes']: driver genes per cluster")