--- name: scvelo description: RNA velocity analysis with scVelo. Estimate cell state transitions from unspliced/spliced mRNA dynamics, infer trajectory directions, compute latent time, and identify driver genes in single-cell RNA-seq data. Complements Scanpy/scVI-tools for trajectory inference. license: BSD-3-Clause metadata: skill-author: Kuan-lin Huang --- # scVelo — RNA Velocity Analysis ## Overview scVelo is the leading Python package for RNA velocity analysis in single-cell RNA-seq data. It infers cell state transitions by modeling the kinetics of mRNA splicing — using the ratio of unspliced (pre-mRNA) to spliced (mature mRNA) abundances to determine whether a gene is being upregulated or downregulated in each cell. This allows reconstruction of developmental trajectories and identification of cell fate decisions without requiring time-course data. **Installation:** `pip install scvelo` **Key resources:** - Documentation: https://scvelo.readthedocs.io/ - GitHub: https://github.com/theislab/scvelo - Paper: Bergen et al. (2020) Nature Biotechnology. PMID: 32747759 ## When to Use This Skill Use scVelo when: - **Trajectory inference from snapshot data**: Determine which direction cells are differentiating - **Cell fate prediction**: Identify progenitor cells and their downstream fates - **Driver gene identification**: Find genes whose dynamics best explain observed trajectories - **Developmental biology**: Model hematopoiesis, neurogenesis, epithelial-to-mesenchymal transitions - **Latent time estimation**: Order cells along a pseudotime derived from splicing dynamics - **Complement to Scanpy**: Add directional information to UMAP embeddings ## Prerequisites scVelo requires count matrices for both **unspliced** and **spliced** RNA. These are generated by: 1. **STARsolo** or **kallisto|bustools** with `lamanno` mode 2. **velocyto** CLI: `velocyto run10x` / `velocyto run` 3. **alevin-fry** / **simpleaf** with spliced/unspliced output Data is stored in an `AnnData` object with `layers["spliced"]` and `layers["unspliced"]`. ## Standard RNA Velocity Workflow ### 1. Setup and Data Loading ```python import scvelo as scv import scanpy as sc import numpy as np import matplotlib.pyplot as plt # Configure settings scv.settings.verbosity = 3 # Show computation steps scv.settings.presenter_view = True scv.settings.set_figure_params('scvelo') # Load data (AnnData with spliced/unspliced layers) # Option A: Load from loom (velocyto output) adata = scv.read("cellranger_output.loom", cache=True) # Option B: Merge velocyto loom with Scanpy-processed AnnData adata_processed = sc.read_h5ad("processed.h5ad") # Has UMAP, clusters adata_velocity = scv.read("velocyto.loom") adata = scv.utils.merge(adata_processed, adata_velocity) # Verify layers print(adata) # obs × var: N × G # layers: 'spliced', 'unspliced' (required) # obsm['X_umap'] (required for visualization) ``` ### 2. Preprocessing ```python # Filter and normalize (follows Scanpy conventions) scv.pp.filter_and_normalize( adata, min_shared_counts=20, # Minimum counts in spliced+unspliced n_top_genes=2000 # Top highly variable genes ) # Compute first and second order moments (means and variances) # knn_connectivities must be computed first sc.pp.neighbors(adata, n_neighbors=30, n_pcs=30) scv.pp.moments( adata, n_pcs=30, n_neighbors=30 ) ``` ### 3. Velocity Estimation — Stochastic Model The stochastic model is fast and suitable for exploratory analysis: ```python # Stochastic velocity (faster, less accurate) scv.tl.velocity(adata, mode='stochastic') scv.tl.velocity_graph(adata) # Visualize scv.pl.velocity_embedding_stream( adata, basis='umap', color='leiden', title="RNA Velocity (Stochastic)" ) ``` ### 4. Velocity Estimation — Dynamical Model (Recommended) The dynamical model fits the full splicing kinetics and is more accurate: ```python # Recover dynamics (computationally intensive; ~10-30 min for 10K cells) scv.tl.recover_dynamics(adata, n_jobs=4) # Compute velocity from dynamical model scv.tl.velocity(adata, mode='dynamical') scv.tl.velocity_graph(adata) ``` ### 5. Latent Time The dynamical model enables computation of a shared latent time (pseudotime): ```python # Compute latent time scv.tl.latent_time(adata) # Visualize latent time on UMAP scv.pl.scatter( adata, color='latent_time', color_map='gnuplot', size=80, title='Latent time' ) # Identify top genes ordered by latent time top_genes = adata.var['fit_likelihood'].sort_values(ascending=False).index[:300] scv.pl.heatmap( adata, var_names=top_genes, sortby='latent_time', col_color='leiden', n_convolve=100 ) ``` ### 6. Driver Gene Analysis ```python # Identify genes with highest velocity fit scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3) df = scv.DataFrame(adata.uns['rank_velocity_genes']['names']) print(df.head(10)) # Speed and coherence scv.tl.velocity_confidence(adata) scv.pl.scatter( adata, c=['velocity_length', 'velocity_confidence'], cmap='coolwarm', perc=[5, 95] ) # Phase portraits for specific genes scv.pl.velocity(adata, ['Cpe', 'Gnao1', 'Ins2'], ncols=3, figsize=(16, 4)) ``` ### 7. Velocity Arrows and Pseudotime ```python # Arrow plot on UMAP scv.pl.velocity_embedding( adata, arrow_length=3, arrow_size=2, color='leiden', basis='umap' ) # Stream plot (cleaner visualization) scv.pl.velocity_embedding_stream( adata, basis='umap', color='leiden', smooth=0.8, min_mass=4 ) # Velocity pseudotime (alternative to latent time) scv.tl.velocity_pseudotime(adata) scv.pl.scatter(adata, color='velocity_pseudotime', cmap='gnuplot') ``` ### 8. PAGA Trajectory Graph ```python # PAGA graph with velocity-informed transitions scv.tl.paga(adata, groups='leiden') df = scv.get_df(adata, 'paga/transitions_confidence', precision=2).T df.style.background_gradient(cmap='Blues').format('{:.2g}') # Plot PAGA with velocity scv.pl.paga( adata, basis='umap', size=50, alpha=0.1, min_edge_width=2, node_size_scale=1.5 ) ``` ## Complete Workflow Script ```python import scvelo as scv import scanpy as sc def run_rna_velocity(adata, n_top_genes=2000, mode='dynamical', n_jobs=4): """ Complete RNA velocity workflow. Args: adata: AnnData with 'spliced' and 'unspliced' layers, UMAP in obsm n_top_genes: Number of top HVGs for velocity mode: 'stochastic' (fast) or 'dynamical' (accurate) n_jobs: Parallel jobs for dynamical model Returns: Processed AnnData with velocity information """ scv.settings.verbosity = 2 # 1. Preprocessing scv.pp.filter_and_normalize(adata, min_shared_counts=20, n_top_genes=n_top_genes) if 'neighbors' not in adata.uns: sc.pp.neighbors(adata, n_neighbors=30) scv.pp.moments(adata, n_pcs=30, n_neighbors=30) # 2. Velocity estimation if mode == 'dynamical': scv.tl.recover_dynamics(adata, n_jobs=n_jobs) scv.tl.velocity(adata, mode=mode) scv.tl.velocity_graph(adata) # 3. Downstream analyses if mode == 'dynamical': scv.tl.latent_time(adata) scv.tl.rank_velocity_genes(adata, groupby='leiden', min_corr=0.3) scv.tl.velocity_confidence(adata) scv.tl.velocity_pseudotime(adata) return adata ``` ## Key Output Fields in AnnData After running the workflow, the following fields are added: | Location | Key | Description | |----------|-----|-------------| | `adata.layers` | `velocity` | RNA velocity per gene per cell | | `adata.layers` | `fit_t` | Fitted latent time per gene per cell | | `adata.obsm` | `velocity_umap` | 2D velocity vectors on UMAP | | `adata.obs` | `velocity_pseudotime` | Pseudotime from velocity | | `adata.obs` | `latent_time` | Latent time from dynamical model | | `adata.obs` | `velocity_length` | Speed of each cell | | `adata.obs` | `velocity_confidence` | Confidence score per cell | | `adata.var` | `fit_likelihood` | Gene-level model fit quality | | `adata.var` | `fit_alpha` | Transcription rate | | `adata.var` | `fit_beta` | Splicing rate | | `adata.var` | `fit_gamma` | Degradation rate | | `adata.uns` | `velocity_graph` | Cell-cell transition probability matrix | ## Velocity Models Comparison | Model | Speed | Accuracy | When to Use | |-------|-------|----------|-------------| | `stochastic` | Fast | Moderate | Exploratory; large datasets | | `deterministic` | Medium | Moderate | Simple linear kinetics | | `dynamical` | Slow | High | Publication-quality; identifies driver genes | ## Best Practices - **Start with stochastic mode** for exploration; switch to dynamical for final analysis - **Need good coverage of unspliced reads**: Short reads (< 100 bp) may miss intron coverage - **Minimum 2,000 cells**: RNA velocity is noisy with fewer cells - **Velocity should be coherent**: Arrows should follow known biology; randomness indicates issues - **k-NN bandwidth matters**: Too few neighbors → noisy velocity; too many → oversmoothed - **Sanity check**: Root cells (progenitors) should have high unspliced/spliced ratios for marker genes - **Dynamical model requires distinct kinetic states**: Works best for clear differentiation processes ## Troubleshooting | Problem | Solution | |---------|---------| | Missing unspliced layer | Re-run velocyto or use STARsolo with `--soloFeatures Gene Velocyto` | | Very few velocity genes | Lower `min_shared_counts`; check sequencing depth | | Random-looking arrows | Try different `n_neighbors` or velocity model | | Memory error with dynamical | Set `n_jobs=1`; reduce `n_top_genes` | | Negative velocity everywhere | Check that spliced/unspliced layers are not swapped | ## Additional Resources - **scVelo documentation**: https://scvelo.readthedocs.io/ - **Tutorial notebooks**: https://scvelo.readthedocs.io/tutorials/ - **GitHub**: https://github.com/theislab/scvelo - **Paper**: Bergen V et al. (2020) Nature Biotechnology. PMID: 32747759 - **velocyto** (preprocessing): http://velocyto.org/ - **CellRank** (fate prediction, extends scVelo): https://cellrank.readthedocs.io/ - **dynamo** (metabolic labeling alternative): https://dynamo-release.readthedocs.io/