mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Add scVelo RNA velocity analysis workflow and IQ-TREE reference documentation
- Introduced a comprehensive RNA velocity analysis pipeline using scVelo, including data loading, preprocessing, velocity estimation, and visualization. - Added a script for running RNA velocity analysis with customizable parameters and output options. - Created detailed documentation for IQ-TREE 2 phylogenetic inference, covering command syntax, model selection, bootstrapping methods, and output interpretation. - Included references for velocity models and their mathematical framework, along with a comparison of different models. - Enhanced the scVelo skill documentation with installation instructions, use cases, and best practices for RNA velocity analysis.
This commit is contained in:
321
scientific-skills/scvelo/SKILL.md
Normal file
321
scientific-skills/scvelo/SKILL.md
Normal file
@@ -0,0 +1,321 @@
|
||||
---
|
||||
name: scvelo
|
||||
description: RNA velocity analysis with scVelo. Estimate cell state transitions from unspliced/spliced mRNA dynamics, infer trajectory directions, compute latent time, and identify driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for trajectory inference.
|
||||
license: BSD-3-Clause
|
||||
metadata:
|
||||
skill-author: Kuan-lin Huang
|
||||
---
|
||||
|
||||
# scVelo — RNA Velocity Analysis
|
||||
|
||||
## Overview
|
||||
|
||||
scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data.
|
||||
|
||||
**Installation:** `pip install scvelo`
|
||||
|
||||
**Key resources:**
|
||||
- Documentation: https://scvelo.readthedocs.io/
|
||||
- GitHub: https://github.com/theislab/scvelo
|
||||
- Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use scVelo when:
|
||||
|
||||
- **Trajectory inference from snapshot data**: Determine which direction cells are differentiating
|
||||
- **Cell fate prediction**: Identify progenitor cells and their downstream fates
|
||||
- **Driver gene identification**: Find genes whose dynamics best explain observed trajectories
|
||||
- **Developmental biology**: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions
|
||||
- **Latent time estimation**: Order cells along a pseudotime derived from splicing dynamics
|
||||
- **Complement to Scanpy**: Add directional information to UMAP embeddings
|
||||
|
||||
## Prerequisites
|
||||
|
||||
scVelo requires count matrices for both **unspliced** and **spliced** RNA. These are generated by:
|
||||
1. **STARsolo** or **kallisto|bustools** with `lamanno` mode
|
||||
2. **velocyto** CLI: `velocyto run10x` / `velocyto run`
|
||||
3. **alevin-fry** / **simpleaf** with spliced/unspliced output
|
||||
|
||||
Data is stored in an `AnnData` object with `layers["spliced"]` and `layers["unspliced"]`.
|
||||
|
||||
## Standard RNA Velocity Workflow
|
||||
|
||||
### 1. Setup and Data Loading
|
||||
|
||||
```python
|
||||
import scvelo as scv
|
||||
import scanpy as sc
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Configure settings
|
||||
scv.settings.verbosity = 3 # Show computation steps
|
||||
scv.settings.presenter_view = True
|
||||
scv.settings.set_figure_params('scvelo')
|
||||
|
||||
# Load data (AnnData with spliced/unspliced layers)
|
||||
# Option A: Load from loom (velocyto output)
|
||||
adata = scv.read("cellranger_output.loom", cache=True)
|
||||
|
||||
# Option B: Merge velocyto loom with Scanpy-processed AnnData
|
||||
adata_processed = sc.read_h5ad("processed.h5ad") # Has UMAP, clusters
|
||||
adata_velocity = scv.read("velocyto.loom")
|
||||
adata = scv.utils.merge(adata_processed, adata_velocity)
|
||||
|
||||
# Verify layers
|
||||
print(adata)
|
||||
# obs × var: N × G
|
||||
# layers: 'spliced', 'unspliced' (required)
|
||||
# obsm['X_umap'] (required for visualization)
|
||||
```
|
||||
|
||||
### 2. Preprocessing
|
||||
|
||||
```python
|
||||
# Filter and normalize (follows Scanpy conventions)
|
||||
scv.pp.filter_and_normalize(
|
||||
adata,
|
||||
min_shared_counts=20, # Minimum counts in spliced+unspliced
|
||||
n_top_genes=2000 # Top highly variable genes
|
||||
)
|
||||
|
||||
# Compute first and second order moments (means and variances)
|
||||
# knn_connectivities must be computed first
|
||||
sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30)
|
||||
scv.pp.moments(
|
||||
adata,
|
||||
n_pcs=30,
|
||||
n_neighbors=30
|
||||
)
|
||||
```
|
||||
|
||||
### 3. Velocity Estimation — Stochastic Model
|
||||
|
||||
The stochastic model is fast and suitable for exploratory analysis:
|
||||
|
||||
```python
|
||||
# Stochastic velocity (faster, less accurate)
|
||||
scv.tl.velocity(adata, mode='stochastic')
|
||||
scv.tl.velocity_graph(adata)
|
||||
|
||||
# Visualize
|
||||
scv.pl.velocity_embedding_stream(
|
||||
adata,
|
||||
basis='umap',
|
||||
color='leiden',
|
||||
title="RNA Velocity (Stochastic)"
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Velocity Estimation — Dynamical Model (Recommended)
|
||||
|
||||
The dynamical model fits the full splicing kinetics and is more accurate:
|
||||
|
||||
```python
|
||||
# Recover dynamics (computationally intensive; ~10-30 min for 10K cells)
|
||||
scv.tl.recover_dynamics(adata, n_jobs=4)
|
||||
|
||||
# Compute velocity from dynamical model
|
||||
scv.tl.velocity(adata, mode='dynamical')
|
||||
scv.tl.velocity_graph(adata)
|
||||
```
|
||||
|
||||
### 5. Latent Time
|
||||
|
||||
The dynamical model enables computation of a shared latent time (pseudotime):
|
||||
|
||||
```python
|
||||
# Compute latent time
|
||||
scv.tl.latent_time(adata)
|
||||
|
||||
# Visualize latent time on UMAP
|
||||
scv.pl.scatter(
|
||||
adata,
|
||||
color='latent_time',
|
||||
color_map='gnuplot',
|
||||
size=80,
|
||||
title='Latent time'
|
||||
)
|
||||
|
||||
# Identify top genes ordered by latent time
|
||||
top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300]
|
||||
scv.pl.heatmap(
|
||||
adata,
|
||||
var_names=top_genes,
|
||||
sortby='latent_time',
|
||||
col_color='leiden',
|
||||
n_convolve=100
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Driver Gene Analysis
|
||||
|
||||
```python
|
||||
# Identify genes with highest velocity fit
|
||||
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
|
||||
df = scv.DataFrame(adata.uns['rank_velocity_genes']['names'])
|
||||
print(df.head(10))
|
||||
|
||||
# Speed and coherence
|
||||
scv.tl.velocity_confidence(adata)
|
||||
scv.pl.scatter(
|
||||
adata,
|
||||
c=['velocity_length', 'velocity_confidence'],
|
||||
cmap='coolwarm',
|
||||
perc=[5, 95]
|
||||
)
|
||||
|
||||
# Phase portraits for specific genes
|
||||
scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'],
|
||||
ncols=3, figsize=(16, 4))
|
||||
```
|
||||
|
||||
### 7. Velocity Arrows and Pseudotime
|
||||
|
||||
```python
|
||||
# Arrow plot on UMAP
|
||||
scv.pl.velocity_embedding(
|
||||
adata,
|
||||
arrow_length=3,
|
||||
arrow_size=2,
|
||||
color='leiden',
|
||||
basis='umap'
|
||||
)
|
||||
|
||||
# Stream plot (cleaner visualization)
|
||||
scv.pl.velocity_embedding_stream(
|
||||
adata,
|
||||
basis='umap',
|
||||
color='leiden',
|
||||
smooth=0.8,
|
||||
min_mass=4
|
||||
)
|
||||
|
||||
# Velocity pseudotime (alternative to latent time)
|
||||
scv.tl.velocity_pseudotime(adata)
|
||||
scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot')
|
||||
```
|
||||
|
||||
### 8. PAGA Trajectory Graph
|
||||
|
||||
```python
|
||||
# PAGA graph with velocity-informed transitions
|
||||
scv.tl.paga(adata, groups='leiden')
|
||||
df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T
|
||||
df.style.background_gradient(cmap='Blues').format('{:.2g}')
|
||||
|
||||
# Plot PAGA with velocity
|
||||
scv.pl.paga(
|
||||
adata,
|
||||
basis='umap',
|
||||
size=50,
|
||||
alpha=0.1,
|
||||
min_edge_width=2,
|
||||
node_size_scale=1.5
|
||||
)
|
||||
```
|
||||
|
||||
## Complete Workflow Script
|
||||
|
||||
```python
|
||||
import scvelo as scv
|
||||
import scanpy as sc
|
||||
|
||||
def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4):
|
||||
"""
|
||||
Complete RNA velocity workflow.
|
||||
|
||||
Args:
|
||||
adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm
|
||||
n_top_genes: Number of top HVGs for velocity
|
||||
mode: 'stochastic' (fast) or 'dynamical' (accurate)
|
||||
n_jobs: Parallel jobs for dynamical model
|
||||
|
||||
Returns:
|
||||
Processed AnnData with velocity information
|
||||
"""
|
||||
scv.settings.verbosity = 2
|
||||
|
||||
# 1. Preprocessing
|
||||
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)
|
||||
|
||||
if 'neighbors' not in adata.uns:
|
||||
sc.pp.neighbors(adata, n_neighbors=30)
|
||||
|
||||
scv.pp.moments(adata, n_pcs=30, n_neighbors=30)
|
||||
|
||||
# 2. Velocity estimation
|
||||
if mode == 'dynamical':
|
||||
scv.tl.recover_dynamics(adata, n_jobs=n_jobs)
|
||||
|
||||
scv.tl.velocity(adata, mode=mode)
|
||||
scv.tl.velocity_graph(adata)
|
||||
|
||||
# 3. Downstream analyses
|
||||
if mode == 'dynamical':
|
||||
scv.tl.latent_time(adata)
|
||||
scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3)
|
||||
|
||||
scv.tl.velocity_confidence(adata)
|
||||
scv.tl.velocity_pseudotime(adata)
|
||||
|
||||
return adata
|
||||
```
|
||||
|
||||
## Key Output Fields in AnnData
|
||||
|
||||
After running the workflow, the following fields are added:
|
||||
|
||||
| Location | Key | Description |
|
||||
|----------|-----|-------------|
|
||||
| `adata.layers` | `velocity` | RNA velocity per gene per cell |
|
||||
| `adata.layers` | `fit_t` | Fitted latent time per gene per cell |
|
||||
| `adata.obsm` | `velocity_umap` | 2D velocity vectors on UMAP |
|
||||
| `adata.obs` | `velocity_pseudotime` | Pseudotime from velocity |
|
||||
| `adata.obs` | `latent_time` | Latent time from dynamical model |
|
||||
| `adata.obs` | `velocity_length` | Speed of each cell |
|
||||
| `adata.obs` | `velocity_confidence` | Confidence score per cell |
|
||||
| `adata.var` | `fit_likelihood` | Gene-level model fit quality |
|
||||
| `adata.var` | `fit_alpha` | Transcription rate |
|
||||
| `adata.var` | `fit_beta` | Splicing rate |
|
||||
| `adata.var` | `fit_gamma` | Degradation rate |
|
||||
| `adata.uns` | `velocity_graph` | Cell-cell transition probability matrix |
|
||||
|
||||
## Velocity Models Comparison
|
||||
|
||||
| Model | Speed | Accuracy | When to Use |
|
||||
|-------|-------|----------|-------------|
|
||||
| `stochastic` | Fast | Moderate | Exploratory; large datasets |
|
||||
| `deterministic` | Medium | Moderate | Simple linear kinetics |
|
||||
| `dynamical` | Slow | High | Publication-quality; identifies driver genes |
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Start with stochastic mode** for exploration; switch to dynamical for final analysis
|
||||
- **Need good coverage of unspliced reads**: Short reads (< 100 bp) may miss intron coverage
|
||||
- **Minimum 2,000 cells**: RNA velocity is noisy with fewer cells
|
||||
- **Velocity should be coherent**: Arrows should follow known biology; randomness indicates issues
|
||||
- **k-NN bandwidth matters**: Too few neighbors → noisy velocity; too many → oversmoothed
|
||||
- **Sanity check**: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes
|
||||
- **Dynamical model requires distinct kinetic states**: Works best for clear differentiation processes
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
| Problem | Solution |
|
||||
|---------|---------|
|
||||
| Missing unspliced layer | Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto` |
|
||||
| Very few velocity genes | Lower `min_shared_counts`; check sequencing depth |
|
||||
| Random-looking arrows | Try different `n_neighbors` or velocity model |
|
||||
| Memory error with dynamical | Set `n_jobs=1`; reduce `n_top_genes` |
|
||||
| Negative velocity everywhere | Check that spliced/unspliced layers are not swapped |
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **scVelo documentation**: https://scvelo.readthedocs.io/
|
||||
- **Tutorial notebooks**: https://scvelo.readthedocs.io/tutorials/
|
||||
- **GitHub**: https://github.com/theislab/scvelo
|
||||
- **Paper**: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759
|
||||
- **velocyto** (preprocessing): http://velocyto.org/
|
||||
- **CellRank** (fate prediction, extends scVelo): https://cellrank.readthedocs.io/
|
||||
- **dynamo** (metabolic labeling alternative): https://dynamo-release.readthedocs.io/
|
||||
168
scientific-skills/scvelo/references/velocity_models.md
Normal file
168
scientific-skills/scvelo/references/velocity_models.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# scVelo Velocity Models Reference
|
||||
|
||||
## Mathematical Framework
|
||||
|
||||
RNA velocity is based on the kinetic model of transcription:
|
||||
|
||||
```
|
||||
dx_s/dt = β·x_u - γ·x_s (spliced dynamics)
|
||||
dx_u/dt = α(t) - β·x_u (unspliced dynamics)
|
||||
```
|
||||
|
||||
Where:
|
||||
- `x_s`: spliced mRNA abundance
|
||||
- `x_u`: unspliced (pre-mRNA) abundance
|
||||
- `α(t)`: transcription rate (varies over time)
|
||||
- `β`: splicing rate
|
||||
- `γ`: degradation rate
|
||||
|
||||
**Velocity** is defined as: `v = dx_s/dt = β·x_u - γ·x_s`
|
||||
|
||||
- **v > 0**: Gene is being upregulated (more unspliced than expected at steady state)
|
||||
- **v < 0**: Gene is being downregulated (less unspliced than expected)
|
||||
|
||||
## Model Comparison
|
||||
|
||||
### Steady-State (Velocyto, original)
|
||||
|
||||
- Assumes constant α (transcription rate)
|
||||
- Fits γ using linear regression on steady-state cells
|
||||
- **Limitation**: Requires identifiable steady states; assumes constant transcription
|
||||
|
||||
```python
|
||||
# Use with scVelo for backward compatibility
|
||||
scv.tl.velocity(adata, mode='steady_state')
|
||||
```
|
||||
|
||||
### Stochastic Model (scVelo v1)
|
||||
|
||||
- Extends steady-state with variance/covariance terms
|
||||
- Models cell-to-cell variability in mRNA counts
|
||||
- More robust to noise than steady-state
|
||||
|
||||
```python
|
||||
scv.tl.velocity(adata, mode='stochastic')
|
||||
```
|
||||
|
||||
### Dynamical Model (scVelo v2, recommended)
|
||||
|
||||
- Jointly estimates all kinetic rates (α, β, γ) and cell-specific latent time
|
||||
- Does not assume steady state
|
||||
- Identifies induction vs. repression phases
|
||||
- Computes fit_likelihood per gene (quality measure)
|
||||
|
||||
```python
|
||||
scv.tl.recover_dynamics(adata, n_jobs=4)
|
||||
scv.tl.velocity(adata, mode='dynamical')
|
||||
```
|
||||
|
||||
**Kinetic states identified by dynamical model:**
|
||||
|
||||
| State | Description |
|
||||
|-------|-------------|
|
||||
| Induction | α > 0, x_u increasing |
|
||||
| Steady-state on | α > 0, constant high expression |
|
||||
| Repression | α = 0, x_u decreasing |
|
||||
| Steady-state off | α = 0, constant low expression |
|
||||
|
||||
## Velocity Graph
|
||||
|
||||
The velocity graph connects cells based on their velocity similarity to neighboring cells' states:
|
||||
|
||||
```python
|
||||
scv.tl.velocity_graph(adata)
|
||||
# Stored in adata.uns['velocity_graph']
|
||||
# Entry [i,j] = probability that cell i transitions to cell j
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `n_neighbors`: Number of neighbors considered
|
||||
- `sqrt_transform`: Apply sqrt transform to data (default: False for spliced)
|
||||
- `approx`: Use approximate nearest neighbor search (faster for large datasets)
|
||||
|
||||
## Latent Time Interpretation
|
||||
|
||||
Latent time τ ∈ [0, 1] for each gene represents:
|
||||
- τ = 0: Gene is at onset of induction
|
||||
- τ = 0.5: Gene is at peak of induction (for a complete cycle)
|
||||
- τ = 1: Gene has returned to steady-state off
|
||||
|
||||
**Shared latent time** is computed by taking the average over all velocity genes, weighted by fit_likelihood.
|
||||
|
||||
## Quality Metrics
|
||||
|
||||
### Gene-level
|
||||
- `fit_likelihood`: Goodness-of-fit of dynamical model (0-1; higher = better)
|
||||
- Use for filtering driver genes: `adata.var[adata.var['fit_likelihood'] > 0.1]`
|
||||
- `fit_alpha`: Transcription rate during induction
|
||||
- `fit_gamma`: mRNA degradation rate
|
||||
- `fit_r2`: R² of kinetic fit
|
||||
|
||||
### Cell-level
|
||||
- `velocity_length`: Magnitude of velocity vector (cell speed)
|
||||
- `velocity_confidence`: Coherence of velocity with neighboring cells (0-1)
|
||||
|
||||
### Dataset-level
|
||||
```python
|
||||
# Check overall velocity quality
|
||||
scv.pl.proportions(adata) # Ratio of spliced/unspliced per cell
|
||||
scv.pl.velocity_confidence(adata, groupby='leiden')
|
||||
```
|
||||
|
||||
## Parameter Tuning Guide
|
||||
|
||||
| Parameter | Function | Default | When to Change |
|
||||
|-----------|----------|---------|----------------|
|
||||
| `min_shared_counts` | Filter genes | 20 | Increase for deep sequencing; decrease for shallow |
|
||||
| `n_top_genes` | HVG selection | 2000 | Increase for complex datasets |
|
||||
| `n_neighbors` | kNN graph | 30 | Decrease for small datasets; increase for noisy |
|
||||
| `n_pcs` | PCA dimensions | 30 | Match to elbow in scree plot |
|
||||
| `t_max_rank` | Latent time constraint | None | Set if known developmental direction |
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
### CellRank (Fate Prediction)
|
||||
|
||||
```python
|
||||
import cellrank as cr
|
||||
from cellrank.kernels import VelocityKernel, ConnectivityKernel
|
||||
|
||||
# Combine velocity and connectivity kernels
|
||||
vk = VelocityKernel(adata).compute_transition_matrix()
|
||||
ck = ConnectivityKernel(adata).compute_transition_matrix()
|
||||
combined = 0.8 * vk + 0.2 * ck
|
||||
|
||||
# Compute macrostates (terminal and initial states)
|
||||
g = cr.estimators.GPCCA(combined)
|
||||
g.compute_macrostates(n_states=4, cluster_key='leiden')
|
||||
g.plot_macrostates(which="all")
|
||||
|
||||
# Compute fate probabilities
|
||||
g.compute_fate_probabilities()
|
||||
g.plot_fate_probabilities()
|
||||
```
|
||||
|
||||
### Scanpy Integration
|
||||
|
||||
scVelo works natively with Scanpy's AnnData:
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
import scvelo as scv
|
||||
|
||||
# Run standard Scanpy pipeline first
|
||||
sc.pp.normalize_total(adata)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata)
|
||||
sc.pp.pca(adata)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata)
|
||||
|
||||
# Then add velocity on top
|
||||
scv.pp.moments(adata)
|
||||
scv.tl.recover_dynamics(adata)
|
||||
scv.tl.velocity(adata, mode='dynamical')
|
||||
scv.tl.velocity_graph(adata)
|
||||
scv.tl.latent_time(adata)
|
||||
```
|
||||
232
scientific-skills/scvelo/scripts/rna_velocity_workflow.py
Normal file
232
scientific-skills/scvelo/scripts/rna_velocity_workflow.py
Normal file
@@ -0,0 +1,232 @@
|
||||
"""
|
||||
RNA Velocity Analysis Workflow using scVelo
|
||||
===========================================
|
||||
Complete pipeline from raw data to velocity visualization.
|
||||
|
||||
Usage:
|
||||
python rna_velocity_workflow.py
|
||||
|
||||
Or import and use run_velocity_analysis() with your AnnData object.
|
||||
"""
|
||||
|
||||
import scvelo as scv
|
||||
import scanpy as sc
|
||||
import numpy as np
|
||||
import matplotlib
|
||||
matplotlib.use('Agg') # Non-interactive backend
|
||||
import matplotlib.pyplot as plt
|
||||
import os
|
||||
|
||||
|
||||
def run_velocity_analysis(
|
||||
adata,
|
||||
groupby="leiden",
|
||||
n_top_genes=2000,
|
||||
n_neighbors=30,
|
||||
mode="dynamical",
|
||||
n_jobs=4,
|
||||
output_dir="velocity_results",
|
||||
):
|
||||
"""
|
||||
Complete RNA velocity analysis workflow.
|
||||
|
||||
Parameters
|
||||
----------
|
||||
adata : AnnData
|
||||
AnnData object with 'spliced' and 'unspliced' layers.
|
||||
Should already have UMAP and cluster annotations.
|
||||
groupby : str
|
||||
Column in adata.obs for cell type labels.
|
||||
n_top_genes : int
|
||||
Number of top highly variable genes.
|
||||
n_neighbors : int
|
||||
Number of neighbors for moment computation.
|
||||
mode : str
|
||||
Velocity model: 'stochastic' (fast) or 'dynamical' (accurate).
|
||||
n_jobs : int
|
||||
Parallel jobs for dynamical model fitting.
|
||||
output_dir : str
|
||||
Directory for saving output figures.
|
||||
|
||||
Returns
|
||||
-------
|
||||
AnnData with velocity annotations.
|
||||
"""
|
||||
os.makedirs(output_dir, exist_ok=True)
|
||||
|
||||
# ── Settings ──────────────────────────────────────────────────────────────
|
||||
scv.settings.verbosity = 2
|
||||
scv.settings.figdir = output_dir
|
||||
|
||||
# ── Step 1: Check layers ───────────────────────────────────────────────────
|
||||
assert "spliced" in adata.layers, "Missing 'spliced' layer. Run velocyto first."
|
||||
assert "unspliced" in adata.layers, "Missing 'unspliced' layer. Run velocyto first."
|
||||
print(f"Input: {adata.n_obs} cells × {adata.n_vars} genes")
|
||||
|
||||
# ── Step 2: Preprocessing ─────────────────────────────────────────────────
|
||||
print("Step 1/5: Preprocessing...")
|
||||
scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes)
|
||||
|
||||
if "neighbors" not in adata.uns:
|
||||
sc.pp.neighbors(adata, n_neighbors=n_neighbors, n_pcs=30)
|
||||
|
||||
scv.pp.moments(adata, n_pcs=30, n_neighbors=n_neighbors)
|
||||
print(f" {adata.n_vars} velocity genes selected")
|
||||
|
||||
# ── Step 3: Velocity estimation ────────────────────────────────────────────
|
||||
print(f"Step 2/5: Fitting velocity model ({mode})...")
|
||||
if mode == "dynamical":
|
||||
scv.tl.recover_dynamics(adata, n_jobs=n_jobs)
|
||||
scv.tl.velocity(adata, mode=mode)
|
||||
scv.tl.velocity_graph(adata)
|
||||
print(" Velocity graph computed")
|
||||
|
||||
# ── Step 4: Downstream analyses ────────────────────────────────────────────
|
||||
print("Step 3/5: Computing latent time and confidence...")
|
||||
scv.tl.velocity_confidence(adata)
|
||||
scv.tl.velocity_pseudotime(adata)
|
||||
|
||||
if mode == "dynamical":
|
||||
scv.tl.latent_time(adata)
|
||||
|
||||
if groupby in adata.obs.columns:
|
||||
scv.tl.rank_velocity_genes(adata, groupby=groupby, min_corr=0.3)
|
||||
|
||||
# ── Step 5: Visualization ─────────────────────────────────────────────────
|
||||
print("Step 4/5: Generating figures...")
|
||||
|
||||
# Stream plot
|
||||
scv.pl.velocity_embedding_stream(
|
||||
adata,
|
||||
basis="umap",
|
||||
color=groupby,
|
||||
title="RNA Velocity",
|
||||
save=f"{output_dir}/velocity_stream.png",
|
||||
)
|
||||
|
||||
# Arrow plot
|
||||
scv.pl.velocity_embedding(
|
||||
adata,
|
||||
arrow_length=3,
|
||||
arrow_size=2,
|
||||
color=groupby,
|
||||
basis="umap",
|
||||
save=f"{output_dir}/velocity_arrows.png",
|
||||
)
|
||||
|
||||
# Pseudotime
|
||||
scv.pl.scatter(
|
||||
adata,
|
||||
color="velocity_pseudotime",
|
||||
cmap="gnuplot",
|
||||
title="Velocity Pseudotime",
|
||||
save=f"{output_dir}/pseudotime.png",
|
||||
)
|
||||
|
||||
if mode == "dynamical" and "latent_time" in adata.obs:
|
||||
scv.pl.scatter(
|
||||
adata,
|
||||
color="latent_time",
|
||||
color_map="gnuplot",
|
||||
title="Latent Time",
|
||||
save=f"{output_dir}/latent_time.png",
|
||||
)
|
||||
|
||||
# Speed and coherence
|
||||
scv.pl.scatter(
|
||||
adata,
|
||||
c=["velocity_length", "velocity_confidence"],
|
||||
cmap="coolwarm",
|
||||
perc=[5, 95],
|
||||
save=f"{output_dir}/velocity_quality.png",
|
||||
)
|
||||
|
||||
# Top driver genes heatmap (dynamical only)
|
||||
if mode == "dynamical" and "fit_likelihood" in adata.var:
|
||||
top_genes = adata.var["fit_likelihood"].sort_values(ascending=False).index[:50]
|
||||
scv.pl.heatmap(
|
||||
adata,
|
||||
var_names=top_genes,
|
||||
sortby="latent_time",
|
||||
col_color=groupby,
|
||||
n_convolve=50,
|
||||
save=f"{output_dir}/driver_gene_heatmap.png",
|
||||
)
|
||||
|
||||
# ── Step 6: Save results ───────────────────────────────────────────────────
|
||||
print("Step 5/5: Saving results...")
|
||||
output_h5ad = os.path.join(output_dir, "adata_velocity.h5ad")
|
||||
adata.write_h5ad(output_h5ad)
|
||||
print(f" Saved to {output_h5ad}")
|
||||
|
||||
# Summary statistics
|
||||
confidence = adata.obs["velocity_confidence"].dropna()
|
||||
print("\nSummary:")
|
||||
print(f" Velocity model: {mode}")
|
||||
print(f" Cells: {adata.n_obs}")
|
||||
print(f" Velocity genes: {adata.n_vars}")
|
||||
print(f" Mean velocity confidence: {confidence.mean():.3f}")
|
||||
print(f" High-confidence cells (>0.7): {(confidence > 0.7).sum()} ({(confidence > 0.7).mean():.1%})")
|
||||
|
||||
if mode == "dynamical" and "fit_likelihood" in adata.var:
|
||||
good_genes = (adata.var["fit_likelihood"] > 0.1).sum()
|
||||
print(f" Well-fit genes (likelihood>0.1): {good_genes}")
|
||||
|
||||
print(f"\nOutput files saved to: {output_dir}/")
|
||||
return adata
|
||||
|
||||
|
||||
def load_from_loom(loom_path, processed_h5ad=None):
|
||||
"""
|
||||
Load velocity data from velocyto loom file.
|
||||
|
||||
Args:
|
||||
loom_path: Path to velocyto output loom file
|
||||
processed_h5ad: Optional path to pre-processed Scanpy h5ad file
|
||||
"""
|
||||
adata_loom = scv.read(loom_path, cache=True)
|
||||
|
||||
if processed_h5ad:
|
||||
adata_processed = sc.read_h5ad(processed_h5ad)
|
||||
# Merge: keep processed metadata and add velocity layers
|
||||
adata = scv.utils.merge(adata_processed, adata_loom)
|
||||
else:
|
||||
adata = adata_loom
|
||||
# Run basic Scanpy pipeline
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=3000)
|
||||
sc.pp.pca(adata)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
sc.tl.leiden(adata, resolution=0.5)
|
||||
|
||||
return adata
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
# Example usage with simulated data (for testing)
|
||||
print("scVelo RNA Velocity Workflow - Demo Mode")
|
||||
print("=" * 50)
|
||||
|
||||
# Load example dataset
|
||||
adata = scv.datasets.pancreas()
|
||||
print(f"Loaded pancreas dataset: {adata}")
|
||||
|
||||
# Run analysis
|
||||
adata = run_velocity_analysis(
|
||||
adata,
|
||||
groupby="clusters",
|
||||
n_top_genes=2000,
|
||||
mode="dynamical",
|
||||
n_jobs=2,
|
||||
output_dir="pancreas_velocity",
|
||||
)
|
||||
|
||||
print("\nAnalysis complete!")
|
||||
print(f"Key results:")
|
||||
print(f" adata.layers['velocity']: velocity per gene per cell")
|
||||
print(f" adata.obs['latent_time']: pseudotime from dynamics")
|
||||
print(f" adata.obs['velocity_confidence']: per-cell confidence")
|
||||
if "rank_velocity_genes" in adata.uns:
|
||||
print(f" adata.uns['rank_velocity_genes']: driver genes per cluster")
|
||||
Reference in New Issue
Block a user