mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization. - Added a script for running RNA velocity analysis with customizable parameters and output options. - Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation. - Included references for velocity models and their mathematical framework, along with a comparison of different models. - Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
322 lines
10 KiB
Markdown
322 lines
10 KiB
Markdown
---
|
||
name: scvelo
|
||
description: RNA velocity analysis with scVelo. Estimate cell state transitions from unspliced/spliced mRNA dynamics, infer trajectory directions, compute latent time, and identify driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for trajectory inference.
|
||
license: BSD-3-Clause
|
||
metadata:
|
||
skill-author: Kuan-lin Huang
|
||
---
|
||
|
||
# scVelo — RNA Velocity Analysis
|
||
|
||
## Overview
|
||
|
||
scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.
|
||
|
||
**Installation:** `pip install scvelo`
|
||
|
||
**Key resources:**
|
||
- Documentation: https://scvelo.readthedocs.io/
|
||
- GitHub: https://github.com/theislab/scvelo
|
||
- Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759
|
||
|
||
## When to Use This Skill
|
||
|
||
Use scVelo when:
|
||
|
||
- **Trajectory inference from snapshot data**: Determine which direction cells are differentiating
|
||
- **Cell fate prediction**: Identify progenitor cells and their downstream fates
|
||
- **Driver gene identification**: Find genes whose dynamics best explain observed trajectories
|
||
- **Developmental biology**: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions
|
||
- **Latent time estimation**: Order cells along a pseudotime derived from splicing dynamics
|
||
- **Complement to Scanpy**: Add directional information to UMAP embeddings
|
||
|
||
## Prerequisites
|
||
|
||
scVelo requires count matrices for both **unspliced** and **spliced** RNA. These are generated by:
|
||
1. **STARsolo** or **kallisto|bustools** with `lamanno` mode
|
||
2. **velocyto** CLI: `velocyto run10x` / `velocyto run`
|
||
3. **alevin-fry** / **simpleaf** with spliced/unspliced output
|
||
|
||
Data is stored in an `AnnData` object with `layers["spliced"]` and `layers["unspliced"]`.
|
||
|
||
## Standard RNA Velocity Workflow
|
||
|
||
### 1. Setup and Data Loading
|
||
|
||
```python
|
||
import scvelo as scv
|
||
import scanpy as sc
|
||
import numpy as np
|
||
import matplotlib.pyplot as plt
|
||
|
||
# Configure settings
|
||
scv.settings.verbosity = 3 # Show computation steps
|
||
scv.settings.presenter_view = True
|
||
scv.settings.set_figure_params('scvelo')
|
||
|
||
# Load data (AnnData with spliced/unspliced layers)
|
||
# Option A: Load from loom (velocyto output)
|
||
adata = scv.read("cellranger_output.loom", cache=True)
|
||
|
||
# Option B: Merge velocyto loom with Scanpy-processed AnnData
|
||
adata_processed = sc.read_h5ad("processed.h5ad") # Has UMAP, clusters
|
||
adata_velocity = scv.read("velocyto.loom")
|
||
adata = scv.utils.merge(adata_processed, adata_velocity)
|
||
|
||
# Verify layers
|
||
print(adata)
|
||
# obs × var: N × G
|
||
# layers: 'spliced', 'unspliced' (required)
|
||
# obsm['X_umap'] (required for visualization)
|
||
```
|
||
|
||
### 2. Preprocessing
|
||
|
||
```python
|
||
# Filter and normalize (follows Scanpy conventions)
|
||
scv.pp.filter_and_normalize(
|
||
adata,
|
||
min_shared_counts=20, # Minimum counts in spliced+unspliced
|
||
n_top_genes=2000 # Top highly variable genes
|
||
)
|
||
|
||
# Compute first and second order moments (means and variances)
|
||
# knn_connectivities must be computed first
|
||
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30)
|
||
scv.pp.moments(
|
||
adata,
|
||
n_pcs=30,
|
||
n_neighbors=30
|
||
)
|
||
```
|
||
|
||
### 3. Velocity Estimation — Stochastic Model
|
||
|
||
The stochastic model is fast and suitable for exploratory analysis:
|
||
|
||
```python
|
||
# Stochastic velocity (faster, less accurate)
|
||
scv.tl.velocity(adata, mode='stochastic')
|
||
scv.tl.velocity_graph(adata)
|
||
|
||
# Visualize
|
||
scv.pl.velocity_embedding_stream(
|
||
adata,
|
||
basis='umap',
|
||
color='leiden',
|
||
title="RNA Velocity (Stochastic)"
|
||
)
|
||
```
|
||
|
||
### 4. Velocity Estimation — Dynamical Model (Recommended)
|
||
|
||
The dynamical model fits the full splicing kinetics and is more accurate:
|
||
|
||
```python
|
||
# Recover dynamics (computationally intensive; ~10-30 min for 10K cells)
|
||
scv.tl.recover_dynamics(adata, n_jobs=4)
|
||
|
||
# Compute velocity from dynamical model
|
||
scv.tl.velocity(adata, mode='dynamical')
|
||
scv.tl.velocity_graph(adata)
|
||
```
|
||
|
||
### 5. Latent Time
|
||
|
||
The dynamical model enables computation of a shared latent time (pseudotime):
|
||
|
||
```python
|
||
# Compute latent time
|
||
scv.tl.latent_time(adata)
|
||
|
||
# Visualize latent time on UMAP
|
||
scv.pl.scatter(
|
||
adata,
|
||
color='latent_time',
|
||
color_map='gnuplot',
|
||
size=80,
|
||
title='Latent time'
|
||
)
|
||
|
||
# Identify top genes ordered by latent time
|
||
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
|
||
scv.pl.heatmap(
|
||
adata,
|
||
var_names=top_genes,
|
||
sortby='latent_time',
|
||
col_color='leiden',
|
||
n_convolve=100
|
||
)
|
||
```
|
||
|
||
### 6. Driver Gene Analysis
|
||
|
||
```python
|
||
# Identify genes with highest velocity fit
|
||
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
|
||
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
|
||
print(df.head(10))
|
||
|
||
# Speed and coherence
|
||
scv.tl.velocity_confidence(adata)
|
||
scv.pl.scatter(
|
||
adata,
|
||
c=['velocity_length', 'velocity_confidence'],
|
||
cmap='coolwarm',
|
||
perc=[5, 95]
|
||
)
|
||
|
||
# Phase portraits for specific genes
|
||
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'],
|
||
ncols=3, figsize=(16, 4))
|
||
```
|
||
|
||
### 7. Velocity Arrows and Pseudotime
|
||
|
||
```python
|
||
# Arrow plot on UMAP
|
||
scv.pl.velocity_embedding(
|
||
adata,
|
||
arrow_length=3,
|
||
arrow_size=2,
|
||
color='leiden',
|
||
basis='umap'
|
||
)
|
||
|
||
# Stream plot (cleaner visualization)
|
||
scv.pl.velocity_embedding_stream(
|
||
adata,
|
||
basis='umap',
|
||
color='leiden',
|
||
smooth=0.8,
|
||
min_mass=4
|
||
)
|
||
|
||
# Velocity pseudotime (alternative to latent time)
|
||
scv.tl.velocity_pseudotime(adata)
|
||
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
|
||
```
|
||
|
||
### 8. PAGA Trajectory Graph
|
||
|
||
```python
|
||
# PAGA graph with velocity-informed transitions
|
||
scv.tl.paga(adata, groups='leiden')
|
||
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
|
||
df.style.background_gradient(cmap='Blues').format('{:.2g}')
|
||
|
||
# Plot PAGA with velocity
|
||
scv.pl.paga(
|
||
adata,
|
||
basis='umap',
|
||
size=50,
|
||
alpha=0.1,
|
||
min_edge_width=2,
|
||
node_size_scale=1.5
|
||
)
|
||
```
|
||
|
||
## Complete Workflow Script
|
||
|
||
```python
|
||
import scvelo as scv
|
||
import scanpy as sc
|
||
|
||
def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4):
|
||
"""
|
||
Complete RNA velocity workflow.
|
||
|
||
Args:
|
||
adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
|
||
n_top_genes: Number of top HVGs for velocity
|
||
mode: 'stochastic' (fast) or 'dynamical' (accurate)
|
||
n_jobs: Parallel jobs for dynamical model
|
||
|
||
Returns:
|
||
Processed AnnData with velocity information
|
||
"""
|
||
scv.settings.verbosity = 2
|
||
|
||
# 1. Preprocessing
|
||
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)
|
||
|
||
if 'neighbors' not in adata.uns:
|
||
sc.pp.neighbors(adata, n_neighbors=30)
|
||
|
||
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
|
||
|
||
# 2. Velocity estimation
|
||
if mode == 'dynamical':
|
||
scv.tl.recover_dynamics(adata, n_jobs=n_jobs)
|
||
|
||
scv.tl.velocity(adata, mode=mode)
|
||
scv.tl.velocity_graph(adata)
|
||
|
||
# 3. Downstream analyses
|
||
if mode == 'dynamical':
|
||
scv.tl.latent_time(adata)
|
||
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
|
||
|
||
scv.tl.velocity_confidence(adata)
|
||
scv.tl.velocity_pseudotime(adata)
|
||
|
||
return adata
|
||
```
|
||
|
||
## Key Output Fields in AnnData
|
||
|
||
After running the workflow, the following fields are added:
|
||
|
||
| Location | Key | Description |
|
||
|----------|-----|-------------|
|
||
| `adata.layers` | `velocity` | RNA velocity per gene per cell |
|
||
| `adata.layers` | `fit_t` | Fitted latent time per gene per cell |
|
||
| `adata.obsm` | `velocity_umap` | 2D velocity vectors on UMAP |
|
||
| `adata.obs` | `velocity_pseudotime` | Pseudotime from velocity |
|
||
| `adata.obs` | `latent_time` | Latent time from dynamical model |
|
||
| `adata.obs` | `velocity_length` | Speed of each cell |
|
||
| `adata.obs` | `velocity_confidence` | Confidence score per cell |
|
||
| `adata.var` | `fit_likelihood` | Gene-level model fit quality |
|
||
| `adata.var` | `fit_alpha` | Transcription rate |
|
||
| `adata.var` | `fit_beta` | Splicing rate |
|
||
| `adata.var` | `fit_gamma` | Degradation rate |
|
||
| `adata.uns` | `velocity_graph` | Cell-cell transition probability matrix |
|
||
|
||
## Velocity Models Comparison
|
||
|
||
| Model | Speed | Accuracy | When to Use |
|
||
|-------|-------|----------|-------------|
|
||
| `stochastic` | Fast | Moderate | Exploratory; large datasets |
|
||
| `deterministic` | Medium | Moderate | Simple linear kinetics |
|
||
| `dynamical` | Slow | High | Publication-quality; identifies driver genes |
|
||
|
||
## Best Practices
|
||
|
||
- **Start with stochastic mode** for exploration; switch to dynamical for final analysis
|
||
- **Need good coverage of unspliced reads**: Short reads (< 100 bp) may miss intron coverage
|
||
- **Minimum 2,000 cells**: RNA velocity is noisy with fewer cells
|
||
- **Velocity should be coherent**: Arrows should follow known biology; randomness indicates issues
|
||
- **k-NN bandwidth matters**: Too few neighbors → noisy velocity; too many → oversmoothed
|
||
- **Sanity check**: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
|
||
- **Dynamical model requires distinct kinetic states**: Works best for clear differentiation processes
|
||
|
||
## Troubleshooting
|
||
|
||
| Problem | Solution |
|
||
|---------|---------|
|
||
| Missing unspliced layer | Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto` |
|
||
| Very few velocity genes | Lower `min_shared_counts`; check sequencing depth |
|
||
| Random-looking arrows | Try different `n_neighbors` or velocity model |
|
||
| Memory error with dynamical | Set `n_jobs=1`; reduce `n_top_genes` |
|
||
| Negative velocity everywhere | Check that spliced/unspliced layers are not swapped |
|
||
|
||
## Additional Resources
|
||
|
||
- **scVelo documentation**: https://scvelo.readthedocs.io/
|
||
- **Tutorial notebooks**: https://scvelo.readthedocs.io/tutorials/
|
||
- **GitHub**: https://github.com/theislab/scvelo
|
||
- **Paper**: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
|
||
- **velocyto** (preprocessing): http://velocyto.org/
|
||
- **CellRank** (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
|
||
- **dynamo** (metabolic labeling alternative): https://dynamo-release.readthedocs.io/
|