diff --git a/scientific-packages/scvi-tools/SKILL.md b/scientific-packages/scvi-tools/SKILL.md new file mode 100644 index 0000000..c8aca95 --- /dev/null +++ b/scientific-packages/scvi-tools/SKILL.md @@ -0,0 +1,184 @@ +--- +name: scvi-tools +description: This skill should be used when working with single-cell omics data analysis using scvi-tools, including scRNA-seq, scATAC-seq, CITE-seq, spatial transcriptomics, and other single-cell modalities. Use this skill for probabilistic modeling, batch correction, dimensionality reduction, differential expression, cell type annotation, multimodal integration, and spatial analysis tasks. +--- + +# scvi-tools + +## Overview + +scvi-tools is a comprehensive Python framework for probabilistic models in single-cell genomics. Built on PyTorch and PyTorch Lightning, it provides deep generative models using variational inference for analyzing diverse single-cell data modalities. + +## When to Use This Skill + +Use this skill when: +- Analyzing single-cell RNA-seq data (dimensionality reduction, batch correction, integration) +- Working with single-cell ATAC-seq or chromatin accessibility data +- Integrating multimodal data (CITE-seq, multiome, paired/unpaired datasets) +- Analyzing spatial transcriptomics data (deconvolution, spatial mapping) +- Performing differential expression analysis on single-cell data +- Conducting cell type annotation or transfer learning tasks +- Working with specialized single-cell modalities (methylation, cytometry, RNA velocity) +- Building custom probabilistic models for single-cell analysis + +## Core Capabilities + +scvi-tools provides models organized by data modality: + +### 1. Single-Cell RNA-seq Analysis +Core models for expression analysis, batch correction, and integration. See `references/models-scrna-seq.md` for: +- **scVI**: Unsupervised dimensionality reduction and batch correction +- **scANVI**: Semi-supervised cell type annotation and integration +- **AUTOZI**: Zero-inflation detection and modeling +- **VeloVI**: RNA velocity analysis +- **contrastiveVI**: Perturbation effect isolation + +### 2. Chromatin Accessibility (ATAC-seq) +Models for analyzing single-cell chromatin data. See `references/models-atac-seq.md` for: +- **PeakVI**: Peak-based ATAC-seq analysis and integration +- **PoissonVI**: Quantitative fragment count modeling +- **scBasset**: Deep learning approach with motif analysis + +### 3. Multimodal & Multi-omics Integration +Joint analysis of multiple data types. See `references/models-multimodal.md` for: +- **totalVI**: CITE-seq protein and RNA joint modeling +- **MultiVI**: Paired and unpaired multi-omic integration +- **MrVI**: Multi-resolution cross-sample analysis + +### 4. Spatial Transcriptomics +Spatially-resolved transcriptomics analysis. See `references/models-spatial.md` for: +- **DestVI**: Multi-resolution spatial deconvolution +- **Stereoscope**: Cell type deconvolution +- **Tangram**: Spatial mapping and integration +- **scVIVA**: Cell-environment relationship analysis + +### 5. Specialized Modalities +Additional specialized analysis tools. See `references/models-specialized.md` for: +- **MethylVI/MethylANVI**: Single-cell methylation analysis +- **CytoVI**: Flow/mass cytometry batch correction +- **Solo**: Doublet detection +- **CellAssign**: Marker-based cell type annotation + +## Typical Workflow + +All scvi-tools models follow a consistent API pattern: + +```python +# 1. Load and preprocess data (AnnData format) +import scvi +import scanpy as sc + +adata = scvi.data.heart_cell_atlas_subsampled() +sc.pp.filter_genes(adata, min_counts=3) +sc.pp.highly_variable_genes(adata, n_top_genes=1200) + +# 2. Register data with model (specify layers, covariates) +scvi.model.SCVI.setup_anndata( + adata, + layer="counts", # Use raw counts, not log-normalized + batch_key="batch", + categorical_covariate_keys=["donor"], + continuous_covariate_keys=["percent_mito"] +) + +# 3. Create and train model +model = scvi.model.SCVI(adata) +model.train() + +# 4. Extract latent representations and normalized values +latent = model.get_latent_representation() +normalized = model.get_normalized_expression(library_size=1e4) + +# 5. Store in AnnData for downstream analysis +adata.obsm["X_scVI"] = latent +adata.layers["scvi_normalized"] = normalized + +# 6. Downstream analysis with scanpy +sc.pp.neighbors(adata, use_rep="X_scVI") +sc.tl.umap(adata) +sc.tl.leiden(adata) +``` + +**Key Design Principles:** +- **Raw counts required**: Models expect unnormalized count data for optimal performance +- **Unified API**: Consistent interface across all models (setup → train → extract) +- **AnnData-centric**: Seamless integration with the scanpy ecosystem +- **GPU acceleration**: Automatic utilization of available GPUs +- **Batch correction**: Handle technical variation through covariate registration + +## Common Analysis Tasks + +### Differential Expression +Probabilistic DE analysis using the learned generative models: + +```python +de_results = model.differential_expression( + groupby="cell_type", + group1="TypeA", + group2="TypeB", + mode="change", # Use composite hypothesis testing + delta=0.25 # Minimum effect size threshold +) +``` + +See `references/differential-expression.md` for detailed methodology and interpretation. + +### Model Persistence +Save and load trained models: + +```python +# Save model +model.save("./model_directory", overwrite=True) + +# Load model +model = scvi.model.SCVI.load("./model_directory", adata=adata) +``` + +### Batch Correction and Integration +Integrate datasets across batches or studies: + +```python +# Register batch information +scvi.model.SCVI.setup_anndata(adata, batch_key="study") + +# Model automatically learns batch-corrected representations +model = scvi.model.SCVI(adata) +model.train() +latent = model.get_latent_representation() # Batch-corrected +``` + +## Theoretical Foundations + +scvi-tools is built on: +- **Variational inference**: Approximate posterior distributions for scalable Bayesian inference +- **Deep generative models**: VAE architectures that learn complex data distributions +- **Amortized inference**: Shared neural networks for efficient learning across cells +- **Probabilistic modeling**: Principled uncertainty quantification and statistical testing + +See `references/theoretical-foundations.md` for detailed background on the mathematical framework. + +## Additional Resources + +- **Workflows**: `references/workflows.md` contains common workflows, best practices, hyperparameter tuning, and GPU optimization +- **Model References**: Detailed documentation for each model category in the `references/` directory +- **Official Documentation**: https://docs.scvi-tools.org/en/stable/ +- **Tutorials**: https://docs.scvi-tools.org/en/stable/tutorials/index.html +- **API Reference**: https://docs.scvi-tools.org/en/stable/api/index.html + +## Installation + +```bash +pip install scvi-tools +# For GPU support +pip install scvi-tools[cuda] +``` + +## Best Practices + +1. **Use raw counts**: Always provide unnormalized count data to models +2. **Filter genes**: Remove low-count genes before analysis (e.g., `min_counts=3`) +3. **Register covariates**: Include known technical factors (batch, donor, etc.) in `setup_anndata` +4. **Feature selection**: Use highly variable genes for improved performance +5. **Model saving**: Always save trained models to avoid retraining +6. **GPU usage**: Enable GPU acceleration for large datasets (`accelerator="gpu"`) +7. **Scanpy integration**: Store outputs in AnnData objects for downstream analysis diff --git a/scientific-packages/scvi-tools/references/differential-expression.md b/scientific-packages/scvi-tools/references/differential-expression.md new file mode 100644 index 0000000..e7eaa2b --- /dev/null +++ b/scientific-packages/scvi-tools/references/differential-expression.md @@ -0,0 +1,581 @@ +# Differential Expression Analysis in scvi-tools + +This document provides detailed information about differential expression (DE) analysis using scvi-tools' probabilistic framework. + +## Overview + +scvi-tools implements Bayesian differential expression testing that leverages the learned generative models to estimate expression differences between groups. This approach provides several advantages over traditional methods: + +- **Batch correction**: DE testing on batch-corrected representations +- **Uncertainty quantification**: Probabilistic estimates of effect sizes +- **Zero-inflation handling**: Proper modeling of dropout and zeros +- **Flexible comparisons**: Between any groups or cell types +- **Multiple modalities**: Works for RNA, proteins (totalVI), and accessibility (PeakVI) + +## Core Statistical Framework + +### Problem Definition + +The goal is to estimate the log fold-change in expression between two conditions: + +``` +log fold-change = log(μ_B) - log(μ_A) +``` + +Where μ_A and μ_B are the mean expression levels in conditions A and B. + +### Three-Stage Process + +**Stage 1: Estimating Expression Levels** +- Sample from posterior distribution of cellular states +- Generate expression values from the learned generative model +- Aggregate across cells to get population-level estimates + +**Stage 2: Detecting Relevant Features (Hypothesis Testing)** +- Test for differential expression using Bayesian framework +- Two testing modes available: + - **"vanilla" mode**: Point null hypothesis (β = 0) + - **"change" mode**: Composite hypothesis (|β| ≤ δ) + +**Stage 3: Controlling False Discovery** +- Posterior expected False Discovery Proportion (FDP) control +- Selects maximum number of discoveries ensuring E[FDP] ≤ α + +## Basic Usage + +### Simple Two-Group Comparison + +```python +import scvi + +# After training a model +model = scvi.model.SCVI(adata) +model.train() + +# Compare two cell types +de_results = model.differential_expression( + groupby="cell_type", + group1="T cells", + group2="B cells" +) + +# View top DE genes +top_genes = de_results.sort_values("lfc_mean", ascending=False).head(20) +print(top_genes[["lfc_mean", "lfc_std", "bayes_factor", "is_de_fdr_0.05"]]) +``` + +### One vs. Rest Comparison + +```python +# Compare one group against all others +de_results = model.differential_expression( + groupby="cell_type", + group1="T cells" # No group2 = compare to rest +) +``` + +### All Pairwise Comparisons + +```python +# Compare all cell types pairwise +all_comparisons = {} + +cell_types = adata.obs["cell_type"].unique() + +for ct1 in cell_types: + for ct2 in cell_types: + if ct1 != ct2: + key = f"{ct1}_vs_{ct2}" + all_comparisons[key] = model.differential_expression( + groupby="cell_type", + group1=ct1, + group2=ct2 + ) +``` + +## Key Parameters + +### `groupby` (required) +Column in `adata.obs` defining groups to compare. + +```python +# Must be a categorical variable +de_results = model.differential_expression(groupby="cell_type") +``` + +### `group1` and `group2` +Groups to compare. If `group2` is None, compares `group1` to all others. + +```python +# Specific comparison +de = model.differential_expression(groupby="condition", group1="treated", group2="control") + +# One vs rest +de = model.differential_expression(groupby="cell_type", group1="T cells") +``` + +### `mode` (Hypothesis Testing Mode) + +**"vanilla" mode** (default): Point null hypothesis +- Tests if β = 0 exactly +- More sensitive, but may find trivially small effects + +**"change" mode**: Composite null hypothesis +- Tests if |β| ≤ δ +- Requires biologically meaningful change +- Reduces false discoveries of tiny effects + +```python +# Change mode with minimum effect size +de = model.differential_expression( + groupby="cell_type", + group1="T cells", + group2="B cells", + mode="change", + delta=0.25 # Minimum log fold-change +) +``` + +### `delta` +Minimum effect size threshold for "change" mode. +- Typical values: 0.25, 0.5, 0.7 (log scale) +- log2(1.5) ≈ 0.58 (1.5-fold change) +- log2(2) = 1.0 (2-fold change) + +```python +# Require at least 1.5-fold change +de = model.differential_expression( + groupby="condition", + group1="disease", + group2="healthy", + mode="change", + delta=0.58 # log2(1.5) +) +``` + +### `fdr_target` +False discovery rate threshold (default: 0.05) + +```python +# More stringent FDR control +de = model.differential_expression( + groupby="cell_type", + group1="T cells", + fdr_target=0.01 +) +``` + +### `batch_correction` +Whether to perform batch correction during DE testing (default: True) + +```python +# Test within a specific batch +de = model.differential_expression( + groupby="cell_type", + group1="T cells", + group2="B cells", + batch_correction=False +) +``` + +### `n_samples` +Number of posterior samples for estimation (default: 5000) +- More samples = more accurate but slower +- Reduce for speed, increase for precision + +```python +# High precision analysis +de = model.differential_expression( + groupby="cell_type", + group1="T cells", + n_samples=10000 +) +``` + +## Interpreting Results + +### Output Columns + +The results DataFrame contains several important columns: + +**Effect Size Estimates**: +- `lfc_mean`: Mean log fold-change +- `lfc_median`: Median log fold-change +- `lfc_std`: Standard deviation of log fold-change +- `lfc_min`: Lower bound of effect size +- `lfc_max`: Upper bound of effect size + +**Statistical Significance**: +- `bayes_factor`: Bayes factor for differential expression + - Higher values = stronger evidence + - >3 often considered meaningful +- `is_de_fdr_0.05`: Boolean indicating if gene is DE at FDR 0.05 +- `is_de_fdr_0.1`: Boolean indicating if gene is DE at FDR 0.1 + +**Expression Levels**: +- `mean1`: Mean expression in group 1 +- `mean2`: Mean expression in group 2 +- `non_zeros_proportion1`: Proportion of non-zero cells in group 1 +- `non_zeros_proportion2`: Proportion of non-zero cells in group 2 + +### Example Interpretation + +```python +de_results = model.differential_expression( + groupby="cell_type", + group1="T cells", + group2="B cells" +) + +# Find significantly upregulated genes in T cells +upreg_tcells = de_results[ + (de_results["is_de_fdr_0.05"]) & + (de_results["lfc_mean"] > 0) +].sort_values("lfc_mean", ascending=False) + +print(f"Upregulated genes in T cells: {len(upreg_tcells)}") +print(upreg_tcells.head(10)) + +# Find genes with large effect sizes +large_effect = de_results[ + (de_results["is_de_fdr_0.05"]) & + (abs(de_results["lfc_mean"]) > 1) # 2-fold change +] +``` + +## Advanced Usage + +### DE Within Specific Cells + +```python +# Test DE only within a subset of cells +subset_indices = adata.obs["tissue"] == "lung" + +de = model.differential_expression( + idx1=adata.obs["cell_type"] == "T cells" & subset_indices, + idx2=adata.obs["cell_type"] == "B cells" & subset_indices +) +``` + +### Batch-Specific DE + +```python +# Test DE within each batch separately +batches = adata.obs["batch"].unique() + +batch_de_results = {} +for batch in batches: + batch_idx = adata.obs["batch"] == batch + batch_de_results[batch] = model.differential_expression( + idx1=(adata.obs["condition"] == "treated") & batch_idx, + idx2=(adata.obs["condition"] == "control") & batch_idx + ) +``` + +### Pseudo-bulk DE + +```python +# Aggregate cells before DE testing +# Useful for low cell counts per group + +de = model.differential_expression( + groupby="cell_type", + group1="rare_cell_type", + group2="common_cell_type", + n_samples=10000, # More samples for stability + batch_correction=True +) +``` + +## Visualization + +### Volcano Plot + +```python +import matplotlib.pyplot as plt +import numpy as np + +de = model.differential_expression( + groupby="condition", + group1="treated", + group2="control" +) + +# Volcano plot +plt.figure(figsize=(10, 6)) +plt.scatter( + de["lfc_mean"], + -np.log10(1 / (de["bayes_factor"] + 1)), + c=de["is_de_fdr_0.05"], + cmap="coolwarm", + alpha=0.5 +) +plt.xlabel("Log Fold Change") +plt.ylabel("-log10(1/Bayes Factor)") +plt.title("Volcano Plot: Treated vs Control") +plt.axvline(x=0, color='k', linestyle='--', linewidth=0.5) +plt.show() +``` + +### Heatmap of Top DE Genes + +```python +import seaborn as sns + +# Get top DE genes +top_genes = de.sort_values("lfc_mean", ascending=False).head(50).index + +# Get normalized expression +norm_expr = model.get_normalized_expression( + adata, + indices=adata.obs["condition"].isin(["treated", "control"]), + gene_list=top_genes +) + +# Plot heatmap +plt.figure(figsize=(12, 10)) +sns.heatmap( + norm_expr.T, + cmap="viridis", + xticklabels=False, + yticklabels=top_genes +) +plt.title("Top 50 DE Genes") +plt.show() +``` + +### Ranked Gene Plot + +```python +# Plot genes ranked by effect size +de_sorted = de.sort_values("lfc_mean", ascending=False) + +plt.figure(figsize=(12, 6)) +plt.plot(range(len(de_sorted)), de_sorted["lfc_mean"].values) +plt.axhline(y=0, color='r', linestyle='--') +plt.xlabel("Gene Rank") +plt.ylabel("Log Fold Change") +plt.title("Genes Ranked by Effect Size") +plt.show() +``` + +## Comparison with Traditional Methods + +### scvi-tools vs. Wilcoxon Test + +```python +import scanpy as sc + +# Traditional Wilcoxon test +sc.tl.rank_genes_groups( + adata, + groupby="cell_type", + method="wilcoxon", + key_added="wilcoxon" +) + +# scvi-tools DE +de_scvi = model.differential_expression( + groupby="cell_type", + group1="T cells" +) + +# Compare results +wilcox_results = sc.get.rank_genes_groups_df(adata, group="T cells", key="wilcoxon") +``` + +**Advantages of scvi-tools**: +- Accounts for batch effects automatically +- Handles zero-inflation properly +- Provides uncertainty quantification +- No arbitrary pseudocount needed +- Better statistical properties + +**When to use Wilcoxon**: +- Very quick exploratory analysis +- Comparison with published results using Wilcoxon + +## Multi-Modal DE + +### Protein DE (totalVI) + +```python +# Train totalVI on CITE-seq data +totalvi_model = scvi.model.TOTALVI(adata) +totalvi_model.train() + +# RNA differential expression +rna_de = totalvi_model.differential_expression( + groupby="cell_type", + group1="T cells", + group2="B cells", + protein_expression=False # Default +) + +# Protein differential expression +protein_de = totalvi_model.differential_expression( + groupby="cell_type", + group1="T cells", + group2="B cells", + protein_expression=True +) + +print(f"DE genes: {rna_de['is_de_fdr_0.05'].sum()}") +print(f"DE proteins: {protein_de['is_de_fdr_0.05'].sum()}") +``` + +### Differential Accessibility (PeakVI) + +```python +# Train PeakVI on ATAC-seq data +peakvi_model = scvi.model.PEAKVI(atac_adata) +peakvi_model.train() + +# Differential accessibility +da = peakvi_model.differential_accessibility( + groupby="cell_type", + group1="T cells", + group2="B cells" +) + +# Same interpretation as DE +``` + +## Handling Special Cases + +### Low Cell Count Groups + +```python +# Increase posterior samples for stability +de = model.differential_expression( + groupby="cell_type", + group1="rare_type", # e.g., 50 cells + group2="common_type", # e.g., 5000 cells + n_samples=10000 +) +``` + +### Imbalanced Comparisons + +```python +# When groups have very different sizes +# Use change mode to avoid tiny effects + +de = model.differential_expression( + groupby="condition", + group1="rare_condition", + group2="common_condition", + mode="change", + delta=0.5 +) +``` + +### Multiple Testing Correction + +```python +# Already included via FDP control +# But can apply additional corrections + +from statsmodels.stats.multitest import multipletests + +# Bonferroni correction (very conservative) +_, pvals_corrected, _, _ = multipletests( + 1 / (de["bayes_factor"] + 1), + method="bonferroni" +) +``` + +## Performance Considerations + +### Speed Optimization + +```python +# Faster DE testing for large datasets +de = model.differential_expression( + groupby="cell_type", + group1="T cells", + n_samples=1000, # Reduce samples + batch_size=512 # Increase batch size +) +``` + +### Memory Management + +```python +# For very large datasets +# Test one comparison at a time rather than all pairwise + +cell_types = adata.obs["cell_type"].unique() +for ct in cell_types: + de = model.differential_expression( + groupby="cell_type", + group1=ct + ) + # Save results + de.to_csv(f"de_results_{ct}.csv") +``` + +## Best Practices + +1. **Use "change" mode**: For biologically interpretable results +2. **Set appropriate delta**: Based on biological significance +3. **Check expression levels**: Filter lowly expressed genes +4. **Validate findings**: Check marker genes for sanity +5. **Visualize results**: Always plot top DE genes +6. **Report parameters**: Document mode, delta, FDR used +7. **Consider batch effects**: Use batch_correction=True +8. **Multiple comparisons**: Be aware of testing many groups +9. **Sample size**: Ensure sufficient cells per group (>50 recommended) +10. **Biological validation**: Follow up with functional experiments + +## Example: Complete DE Analysis Workflow + +```python +import scvi +import scanpy as sc +import matplotlib.pyplot as plt + +# 1. Train model +scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="batch") +model = scvi.model.SCVI(adata) +model.train() + +# 2. Perform DE analysis +de_results = model.differential_expression( + groupby="cell_type", + group1="Disease_T_cells", + group2="Healthy_T_cells", + mode="change", + delta=0.5, + fdr_target=0.05 +) + +# 3. Filter and analyze +sig_genes = de_results[de_results["is_de_fdr_0.05"]] +upreg = sig_genes[sig_genes["lfc_mean"] > 0].sort_values("lfc_mean", ascending=False) +downreg = sig_genes[sig_genes["lfc_mean"] < 0].sort_values("lfc_mean") + +print(f"Significant genes: {len(sig_genes)}") +print(f"Upregulated: {len(upreg)}") +print(f"Downregulated: {len(downreg)}") + +# 4. Visualize top genes +top_genes = upreg.head(10).index.tolist() + downreg.head(10).index.tolist() + +sc.pl.violin( + adata[adata.obs["cell_type"].isin(["Disease_T_cells", "Healthy_T_cells"])], + keys=top_genes, + groupby="cell_type", + rotation=90 +) + +# 5. Functional enrichment (using external tools) +# E.g., g:Profiler, DAVID, or gprofiler-official Python package +upreg_genes = upreg.head(100).index.tolist() +# Perform pathway analysis... + +# 6. Save results +de_results.to_csv("de_results_disease_vs_healthy.csv") +upreg.to_csv("upregulated_genes.csv") +downreg.to_csv("downregulated_genes.csv") +``` diff --git a/scientific-packages/scvi-tools/references/models-atac-seq.md b/scientific-packages/scvi-tools/references/models-atac-seq.md new file mode 100644 index 0000000..37b2f59 --- /dev/null +++ b/scientific-packages/scvi-tools/references/models-atac-seq.md @@ -0,0 +1,321 @@ +# ATAC-seq and Chromatin Accessibility Models + +This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools. + +## PeakVI + +**Purpose**: Analysis and integration of single-cell ATAC-seq data using peak counts. + +**Key Features**: +- Variational autoencoder specifically designed for scATAC-seq peak data +- Learns low-dimensional representations of chromatin accessibility +- Performs batch correction across samples +- Enables differential accessibility testing +- Integrates multiple ATAC-seq datasets + +**When to Use**: +- Analyzing scATAC-seq peak count matrices +- Integrating multiple ATAC-seq experiments +- Batch correction of chromatin accessibility data +- Dimensionality reduction for ATAC-seq +- Differential accessibility analysis between cell types or conditions + +**Data Requirements**: +- Peak count matrix (cells × peaks) +- Binary or count data for peak accessibility +- Batch/sample annotations (optional, for batch correction) + +**Basic Usage**: +```python +import scvi + +# Prepare data (peaks should be in adata.X) +# Optional: filter peaks +sc.pp.filter_genes(adata, min_cells=3) + +# Setup data +scvi.model.PEAKVI.setup_anndata( + adata, + batch_key="batch" +) + +# Train model +model = scvi.model.PEAKVI(adata) +model.train() + +# Get latent representation (batch-corrected) +latent = model.get_latent_representation() +adata.obsm["X_PeakVI"] = latent + +# Differential accessibility +da_results = model.differential_accessibility( + groupby="cell_type", + group1="TypeA", + group2="TypeB" +) +``` + +**Key Parameters**: +- `n_latent`: Dimensionality of latent space (default: 10) +- `n_hidden`: Number of nodes per hidden layer (default: 128) +- `n_layers`: Number of hidden layers (default: 1) +- `region_factors`: Whether to learn region-specific factors (default: True) +- `latent_distribution`: Distribution for latent space ("normal" or "ln") + +**Outputs**: +- `get_latent_representation()`: Low-dimensional embeddings for cells +- `get_accessibility_estimates()`: Normalized accessibility values +- `differential_accessibility()`: Statistical testing for differential peaks +- `get_region_factors()`: Peak-specific scaling factors + +**Best Practices**: +1. Filter out low-quality peaks (present in very few cells) +2. Include batch information if integrating multiple samples +3. Use latent representations for clustering and UMAP visualization +4. Consider using `region_factors=True` for datasets with high technical variation +5. Store latent embeddings in `adata.obsm` for downstream analysis with scanpy + +## PoissonVI + +**Purpose**: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts). + +**Key Features**: +- Models fragment counts directly (not just peak presence/absence) +- Poisson distribution for count data +- Captures quantitative differences in accessibility +- Enables fine-grained analysis of chromatin state + +**When to Use**: +- Analyzing fragment-level ATAC-seq data +- Need quantitative accessibility measurements +- Higher resolution analysis than binary peak calls +- Investigating gradual changes in chromatin accessibility + +**Data Requirements**: +- Fragment count matrix (cells × genomic regions) +- Count data (not binary) + +**Basic Usage**: +```python +scvi.model.POISSONVI.setup_anndata( + adata, + batch_key="batch" +) + +model = scvi.model.POISSONVI(adata) +model.train() + +# Get results +latent = model.get_latent_representation() +accessibility = model.get_accessibility_estimates() +``` + +**Key Differences from PeakVI**: +- **PeakVI**: Best for standard peak count matrices, faster +- **PoissonVI**: Best for quantitative fragment counts, more detailed + +**When to Choose PoissonVI over PeakVI**: +- Working with fragment counts rather than called peaks +- Need to capture quantitative differences +- Have high-quality, high-coverage data +- Interested in subtle accessibility changes + +## scBasset + +**Purpose**: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis. + +**Key Features**: +- Convolutional neural network (CNN) architecture for sequence-based analysis +- Models raw DNA sequences, not just peak counts +- Enables motif discovery and transcription factor (TF) binding prediction +- Provides interpretable feature importance +- Performs batch correction + +**When to Use**: +- Want to incorporate DNA sequence information +- Interested in TF motif analysis +- Need interpretable models (which sequences drive accessibility) +- Analyzing regulatory elements and TF binding sites +- Predicting accessibility from sequence alone + +**Data Requirements**: +- Peak sequences (extracted from genome) +- Peak accessibility matrix +- Genome reference (for sequence extraction) + +**Basic Usage**: +```python +# scBasset requires sequence information +# First, extract sequences for peaks +from scbasset import utils +sequences = utils.fetch_sequences(adata, genome="hg38") + +# Setup and train +scvi.model.SCBASSET.setup_anndata( + adata, + batch_key="batch" +) + +model = scvi.model.SCBASSET(adata, sequences=sequences) +model.train() + +# Get latent representation +latent = model.get_latent_representation() + +# Interpret model: which sequences/motifs are important +importance_scores = model.get_feature_importance() +``` + +**Key Parameters**: +- `n_latent`: Latent space dimensionality +- `conv_layers`: Number of convolutional layers +- `n_filters`: Number of filters per conv layer +- `filter_size`: Size of convolutional filters + +**Advanced Features**: +- **In silico mutagenesis**: Predict how sequence changes affect accessibility +- **Motif enrichment**: Identify enriched TF motifs in accessible regions +- **Batch correction**: Similar to other scvi-tools models +- **Transfer learning**: Fine-tune on new datasets + +**Interpretability Tools**: +```python +# Get importance scores for sequences +importance = model.get_sequence_importance(region_indices=[0, 1, 2]) + +# Predict accessibility for new sequences +predictions = model.predict_accessibility(new_sequences) +``` + +## Model Selection for ATAC-seq + +### PeakVI +**Choose when**: +- Standard scATAC-seq analysis workflow +- Have peak count matrices (most common format) +- Need fast, efficient batch correction +- Want straightforward differential accessibility +- Prioritize computational efficiency + +**Advantages**: +- Fast training and inference +- Proven track record for scATAC-seq +- Easy integration with scanpy workflow +- Robust batch correction + +### PoissonVI +**Choose when**: +- Have fragment-level count data +- Need quantitative accessibility measures +- Interested in subtle differences +- Have high-coverage, high-quality data + +**Advantages**: +- More detailed quantitative information +- Better for gradient changes +- Appropriate statistical model for counts + +### scBasset +**Choose when**: +- Want to incorporate DNA sequence +- Need biological interpretation (motifs, TFs) +- Interested in regulatory mechanisms +- Have computational resources for CNN training +- Want predictive power for new sequences + +**Advantages**: +- Sequence-based, biologically interpretable +- Motif and TF analysis built-in +- Predictive modeling capabilities +- In silico perturbation experiments + +## Workflow Example: Complete ATAC-seq Analysis + +```python +import scvi +import scanpy as sc + +# 1. Load and preprocess ATAC-seq data +adata = sc.read_h5ad("atac_data.h5ad") + +# 2. Filter low-quality peaks +sc.pp.filter_genes(adata, min_cells=10) + +# 3. Setup and train PeakVI +scvi.model.PEAKVI.setup_anndata( + adata, + batch_key="sample" +) + +model = scvi.model.PEAKVI(adata, n_latent=20) +model.train(max_epochs=400) + +# 4. Extract latent representation +latent = model.get_latent_representation() +adata.obsm["X_PeakVI"] = latent + +# 5. Downstream analysis +sc.pp.neighbors(adata, use_rep="X_PeakVI") +sc.tl.umap(adata) +sc.tl.leiden(adata, key_added="clusters") + +# 6. Differential accessibility +da_results = model.differential_accessibility( + groupby="clusters", + group1="0", + group2="1" +) + +# 7. Save model +model.save("peakvi_model") +``` + +## Integration with Gene Expression (RNA+ATAC) + +For paired multimodal data (RNA+ATAC from same cells), use **MultiVI** instead: + +```python +# For 10x Multiome or similar paired data +scvi.model.MULTIVI.setup_anndata( + adata, + batch_key="sample", + modality_key="modality" # "RNA" or "ATAC" +) + +model = scvi.model.MULTIVI(adata) +model.train() + +# Get joint latent space +latent = model.get_latent_representation() +``` + +See `models-multimodal.md` for more details on multimodal integration. + +## Best Practices for ATAC-seq Analysis + +1. **Quality Control**: + - Filter cells with very low or very high peak counts + - Remove peaks present in very few cells + - Filter mitochondrial and sex chromosome peaks if needed + +2. **Batch Correction**: + - Always include `batch_key` if integrating multiple samples + - Consider technical covariates (sequencing depth, TSS enrichment) + +3. **Feature Selection**: + - Unlike RNA-seq, all peaks are often used + - Consider filtering very rare peaks for efficiency + +4. **Latent Dimensions**: + - Start with `n_latent=10-30` depending on dataset complexity + - Larger values for more heterogeneous datasets + +5. **Downstream Analysis**: + - Use latent representations for clustering and visualization + - Link peaks to genes for regulatory analysis + - Perform motif enrichment on cluster-specific peaks + +6. **Computational Considerations**: + - ATAC-seq matrices are often very large (many peaks) + - Consider downsampling peaks for initial exploration + - Use GPU acceleration for large datasets diff --git a/scientific-packages/scvi-tools/references/models-multimodal.md b/scientific-packages/scvi-tools/references/models-multimodal.md new file mode 100644 index 0000000..4cd3b71 --- /dev/null +++ b/scientific-packages/scvi-tools/references/models-multimodal.md @@ -0,0 +1,367 @@ +# Multimodal and Multi-omics Integration Models + +This document covers models for joint analysis of multiple data modalities in scvi-tools. + +## totalVI (Total Variational Inference) + +**Purpose**: Joint analysis of CITE-seq data (simultaneous RNA and protein measurements from same cells). + +**Key Features**: +- Jointly models gene expression and protein abundance +- Learns shared low-dimensional representations +- Enables protein imputation from RNA data +- Performs differential expression for both modalities +- Handles batch effects in both RNA and protein layers + +**When to Use**: +- Analyzing CITE-seq or REAP-seq data +- Joint RNA + surface protein measurements +- Imputing missing proteins +- Integrating protein and RNA information +- Multi-batch CITE-seq integration + +**Data Requirements**: +- AnnData with gene expression in `.X` or a layer +- Protein measurements in `.obsm["protein_expression"]` +- Same cells measured for both modalities + +**Basic Usage**: +```python +import scvi + +# Setup data - specify both RNA and protein layers +scvi.model.TOTALVI.setup_anndata( + adata, + layer="counts", # RNA counts + protein_expression_obsm_key="protein_expression", # Protein counts + batch_key="batch" +) + +# Train model +model = scvi.model.TOTALVI(adata) +model.train() + +# Get joint latent representation +latent = model.get_latent_representation() + +# Get normalized values for both modalities +rna_normalized = model.get_normalized_expression() +protein_normalized = model.get_normalized_expression( + transform_batch="batch1", + protein_expression=True +) + +# Differential expression (works for both RNA and protein) +rna_de = model.differential_expression(groupby="cell_type") +protein_de = model.differential_expression( + groupby="cell_type", + protein_expression=True +) +``` + +**Key Parameters**: +- `n_latent`: Latent space dimensionality (default: 20) +- `n_layers_encoder`: Number of encoder layers (default: 1) +- `n_layers_decoder`: Number of decoder layers (default: 1) +- `protein_dispersion`: Protein dispersion handling ("protein" or "protein-batch") +- `empirical_protein_background_prior`: Use empirical background for proteins + +**Advanced Features**: + +**Protein Imputation**: +```python +# Impute missing proteins for RNA-only cells +# (useful for mapping RNA-seq to CITE-seq reference) +protein_foreground = model.get_protein_foreground_probability() +imputed_proteins = model.get_normalized_expression( + protein_expression=True, + n_samples=25 +) +``` + +**Denoising**: +```python +# Get denoised counts for both modalities +denoised_rna = model.get_normalized_expression(n_samples=25) +denoised_protein = model.get_normalized_expression( + protein_expression=True, + n_samples=25 +) +``` + +**Best Practices**: +1. Use empirical protein background prior for datasets with ambient protein +2. Consider protein-specific dispersion for heterogeneous protein data +3. Use joint latent space for clustering (better than RNA alone) +4. Validate protein imputation with known markers +5. Check protein QC metrics before training + +## MultiVI (Multi-modal Variational Inference) + +**Purpose**: Integration of paired and unpaired multi-omic data (e.g., RNA + ATAC, paired and unpaired cells). + +**Key Features**: +- Handles paired data (same cells) and unpaired data (different cells) +- Integrates multiple modalities: RNA, ATAC, proteins, etc. +- Missing modality imputation +- Learns shared representations across modalities +- Flexible integration strategy + +**When to Use**: +- 10x Multiome data (paired RNA + ATAC) +- Integrating separate RNA-seq and ATAC-seq experiments +- Some cells with both modalities, some with only one +- Cross-modality imputation tasks + +**Data Requirements**: +- AnnData with multiple modalities +- Modality indicators (which measurements each cell has) +- Can handle: + - All cells with both modalities (fully paired) + - Mix of paired and unpaired cells + - Completely unpaired datasets + +**Basic Usage**: +```python +# Prepare data with modality information +# adata.X should contain all features (genes + peaks) +# adata.var["modality"] indicates "Gene" or "Peak" +# adata.obs["modality"] indicates which modality each cell has + +scvi.model.MULTIVI.setup_anndata( + adata, + batch_key="batch", + modality_key="modality" # Column indicating cell modality +) + +model = scvi.model.MULTIVI(adata) +model.train() + +# Get joint latent representation +latent = model.get_latent_representation() + +# Impute missing modalities +# E.g., predict ATAC for RNA-only cells +imputed_accessibility = model.get_accessibility_estimates( + indices=rna_only_indices +) + +# Get normalized expression/accessibility +rna_normalized = model.get_normalized_expression() +atac_normalized = model.get_accessibility_estimates() +``` + +**Key Parameters**: +- `n_genes`: Number of gene features +- `n_regions`: Number of accessibility regions +- `n_latent`: Latent dimensionality (default: 20) + +**Integration Scenarios**: + +**Scenario 1: Fully Paired (10x Multiome)**: +```python +# All cells have both RNA and ATAC +# Single modality key: "paired" +adata.obs["modality"] = "paired" +``` + +**Scenario 2: Partially Paired**: +```python +# Some cells have both, some RNA-only, some ATAC-only +adata.obs["modality"] = ["RNA+ATAC", "RNA", "ATAC", ...] +``` + +**Scenario 3: Completely Unpaired**: +```python +# Separate RNA and ATAC experiments +adata.obs["modality"] = ["RNA"] * n_rna + ["ATAC"] * n_atac +``` + +**Advanced Use Cases**: + +**Cross-Modality Prediction**: +```python +# Predict peaks from gene expression +accessibility_from_rna = model.get_accessibility_estimates( + indices=rna_only_cells +) + +# Predict genes from accessibility +expression_from_atac = model.get_normalized_expression( + indices=atac_only_cells +) +``` + +**Modality-Specific Analysis**: +```python +# Separate analysis per modality +rna_subset = adata[adata.obs["modality"].str.contains("RNA")] +atac_subset = adata[adata.obs["modality"].str.contains("ATAC")] +``` + +## MrVI (Multi-resolution Variational Inference) + +**Purpose**: Multi-sample analysis accounting for sample-specific and shared variation. + +**Key Features**: +- Simultaneously analyzes multiple samples/conditions +- Decomposes variation into: + - Shared variation (common across samples) + - Sample-specific variation +- Enables sample-level comparisons +- Identifies sample-specific cell states + +**When to Use**: +- Comparing multiple biological samples or conditions +- Identifying sample-specific vs. shared cell states +- Disease vs. healthy sample comparisons +- Understanding inter-sample heterogeneity +- Multi-donor studies + +**Basic Usage**: +```python +scvi.model.MRVI.setup_anndata( + adata, + layer="counts", + batch_key="batch", + sample_key="sample" # Critical: defines biological samples +) + +model = scvi.model.MRVI(adata, n_latent=10, n_latent_sample=5) +model.train() + +# Get representations +shared_latent = model.get_latent_representation() # Shared across samples +sample_specific = model.get_sample_specific_representation() + +# Sample distance matrix +sample_distances = model.get_sample_distances() +``` + +**Key Parameters**: +- `n_latent`: Dimensionality of shared latent space +- `n_latent_sample`: Dimensionality of sample-specific space +- `sample_key`: Column defining biological samples + +**Analysis Workflow**: +```python +# 1. Identify shared cell types across samples +sc.pp.neighbors(adata, use_rep="X_MrVI_shared") +sc.tl.umap(adata) +sc.tl.leiden(adata, key_added="shared_clusters") + +# 2. Analyze sample-specific variation +sample_repr = model.get_sample_specific_representation() + +# 3. Compare samples +distances = model.get_sample_distances() + +# 4. Find sample-enriched genes +de_results = model.differential_expression( + groupby="sample", + group1="Disease", + group2="Healthy" +) +``` + +**Use Cases**: +- **Multi-donor studies**: Separate donor effects from cell type variation +- **Disease studies**: Identify disease-specific vs. shared biology +- **Time series**: Separate temporal from stable variation +- **Batch + biology**: Disentangle technical and biological variation + +## totalVI vs. MultiVI vs. MrVI: When to Use Which? + +### totalVI +**Use for**: CITE-seq (RNA + protein, same cells) +- Paired measurements +- Single modality type per feature +- Focus: protein imputation, joint analysis + +### MultiVI +**Use for**: Multiple modalities (RNA + ATAC, etc.) +- Paired, unpaired, or mixed +- Different feature types +- Focus: cross-modality integration and imputation + +### MrVI +**Use for**: Multi-sample RNA-seq +- Single modality (RNA) +- Multiple biological samples +- Focus: sample-level variation decomposition + +## Integration Best Practices + +### For CITE-seq (totalVI) +1. **Quality control proteins**: Remove low-quality antibodies +2. **Background subtraction**: Use empirical background prior +3. **Joint clustering**: Use joint latent space, not RNA alone +4. **Validation**: Check known markers in both modalities + +### For Multiome/Multi-modal (MultiVI) +1. **Feature filtering**: Filter genes and peaks independently +2. **Balance modalities**: Ensure reasonable representation of each +3. **Modality weights**: Consider if one modality dominates +4. **Imputation validation**: Validate imputed values carefully + +### For Multi-sample (MrVI) +1. **Sample definition**: Carefully define biological samples +2. **Sample size**: Need sufficient cells per sample +3. **Covariate handling**: Properly account for batch vs. sample +4. **Interpretation**: Distinguish technical from biological variation + +## Complete Example: CITE-seq Analysis with totalVI + +```python +import scvi +import scanpy as sc + +# 1. Load CITE-seq data +adata = sc.read_h5ad("cite_seq.h5ad") + +# 2. QC and filtering +sc.pp.filter_genes(adata, min_cells=3) +sc.pp.highly_variable_genes(adata, n_top_genes=4000) + +# Protein QC +protein_counts = adata.obsm["protein_expression"] +# Remove low-quality proteins + +# 3. Setup totalVI +scvi.model.TOTALVI.setup_anndata( + adata, + layer="counts", + protein_expression_obsm_key="protein_expression", + batch_key="batch" +) + +# 4. Train +model = scvi.model.TOTALVI(adata, n_latent=20) +model.train(max_epochs=400) + +# 5. Extract joint representation +latent = model.get_latent_representation() +adata.obsm["X_totalVI"] = latent + +# 6. Clustering on joint space +sc.pp.neighbors(adata, use_rep="X_totalVI") +sc.tl.umap(adata) +sc.tl.leiden(adata, resolution=0.5) + +# 7. Differential expression for both modalities +rna_de = model.differential_expression( + groupby="leiden", + group1="0", + group2="1" +) + +protein_de = model.differential_expression( + groupby="leiden", + group1="0", + group2="1", + protein_expression=True +) + +# 8. Save model +model.save("totalvi_model") +``` diff --git a/scientific-packages/scvi-tools/references/models-scrna-seq.md b/scientific-packages/scvi-tools/references/models-scrna-seq.md new file mode 100644 index 0000000..67cce68 --- /dev/null +++ b/scientific-packages/scvi-tools/references/models-scrna-seq.md @@ -0,0 +1,330 @@ +# Single-Cell RNA-seq Models + +This document covers core models for analyzing single-cell RNA sequencing data in scvi-tools. + +## scVI (Single-Cell Variational Inference) + +**Purpose**: Unsupervised analysis, dimensionality reduction, and batch correction for scRNA-seq data. + +**Key Features**: +- Deep generative model based on variational autoencoders (VAE) +- Learns low-dimensional latent representations that capture biological variation +- Automatically corrects for batch effects and technical covariates +- Enables normalized gene expression estimation +- Supports differential expression analysis + +**When to Use**: +- Initial exploration and dimensionality reduction of scRNA-seq datasets +- Integrating multiple batches or studies +- Generating batch-corrected expression matrices +- Performing probabilistic differential expression analysis + +**Basic Usage**: +```python +import scvi + +# Setup data +scvi.model.SCVI.setup_anndata( + adata, + layer="counts", + batch_key="batch" +) + +# Train model +model = scvi.model.SCVI(adata, n_latent=30) +model.train() + +# Extract results +latent = model.get_latent_representation() +normalized = model.get_normalized_expression() +``` + +**Key Parameters**: +- `n_latent`: Dimensionality of latent space (default: 10) +- `n_layers`: Number of hidden layers (default: 1) +- `n_hidden`: Number of nodes per hidden layer (default: 128) +- `dropout_rate`: Dropout rate for neural networks (default: 0.1) +- `dispersion`: Gene-specific or cell-specific dispersion ("gene" or "gene-batch") +- `gene_likelihood`: Distribution for data ("zinb", "nb", "poisson") + +**Outputs**: +- `get_latent_representation()`: Batch-corrected low-dimensional embeddings +- `get_normalized_expression()`: Denoised, normalized expression values +- `differential_expression()`: Probabilistic DE testing between groups +- `get_feature_correlation_matrix()`: Gene-gene correlation estimates + +## scANVI (Single-Cell ANnotation using Variational Inference) + +**Purpose**: Semi-supervised cell type annotation and integration using labeled and unlabeled cells. + +**Key Features**: +- Extends scVI with cell type labels +- Leverages partially labeled datasets for annotation transfer +- Performs simultaneous batch correction and cell type prediction +- Enables query-to-reference mapping + +**When to Use**: +- Annotating new datasets using reference labels +- Transfer learning from well-annotated to unlabeled datasets +- Joint analysis of labeled and unlabeled cells +- Building cell type classifiers with uncertainty quantification + +**Basic Usage**: +```python +# Option 1: Train from scratch +scvi.model.SCANVI.setup_anndata( + adata, + layer="counts", + batch_key="batch", + labels_key="cell_type", + unlabeled_category="Unknown" +) +model = scvi.model.SCANVI(adata) +model.train() + +# Option 2: Initialize from pretrained scVI +scvi_model = scvi.model.SCVI(adata) +scvi_model.train() +scanvi_model = scvi.model.SCANVI.from_scvi_model( + scvi_model, + unlabeled_category="Unknown" +) +scanvi_model.train() + +# Predict cell types +predictions = scanvi_model.predict() +``` + +**Key Parameters**: +- `labels_key`: Column in `adata.obs` containing cell type labels +- `unlabeled_category`: Label for cells without annotations +- All scVI parameters are also available + +**Outputs**: +- `predict()`: Cell type predictions for all cells +- `predict_proba()`: Prediction probabilities +- `get_latent_representation()`: Cell type-aware latent space + +## AUTOZI + +**Purpose**: Automatic identification and modeling of zero-inflated genes in scRNA-seq data. + +**Key Features**: +- Distinguishes biological zeros from technical dropout +- Learns which genes exhibit zero-inflation +- Provides gene-specific zero-inflation probabilities +- Improves downstream analysis by accounting for dropout + +**When to Use**: +- Detecting which genes are affected by technical dropout +- Improving imputation and normalization for sparse datasets +- Understanding the extent of zero-inflation in your data + +**Basic Usage**: +```python +scvi.model.AUTOZI.setup_anndata(adata, layer="counts") +model = scvi.model.AUTOZI(adata) +model.train() + +# Get zero-inflation probabilities per gene +zi_probs = model.get_alphas_betas() +``` + +## VeloVI + +**Purpose**: RNA velocity analysis using variational inference. + +**Key Features**: +- Joint modeling of spliced and unspliced RNA counts +- Probabilistic estimation of RNA velocity +- Accounts for technical noise and batch effects +- Provides uncertainty quantification for velocity estimates + +**When to Use**: +- Inferring cellular dynamics and differentiation trajectories +- Analyzing spliced/unspliced count data +- RNA velocity analysis with batch correction + +**Basic Usage**: +```python +import scvelo as scv + +# Prepare velocity data +scv.pp.filter_and_normalize(adata) +scv.pp.moments(adata) + +# Train VeloVI +scvi.model.VELOVI.setup_anndata(adata, spliced_layer="Ms", unspliced_layer="Mu") +model = scvi.model.VELOVI(adata) +model.train() + +# Get velocity estimates +latent_time = model.get_latent_time() +velocities = model.get_velocity() +``` + +## contrastiveVI + +**Purpose**: Isolating perturbation-specific variations from background biological variation. + +**Key Features**: +- Separates shared variation (common across conditions) from target-specific variation +- Useful for perturbation studies (drug treatments, genetic perturbations) +- Identifies condition-specific gene programs +- Enables discovery of treatment-specific effects + +**When to Use**: +- Analyzing perturbation experiments (drug screens, CRISPR, etc.) +- Identifying genes responding specifically to treatments +- Separating treatment effects from background variation +- Comparing control vs. perturbed conditions + +**Basic Usage**: +```python +scvi.model.CONTRASTIVEVI.setup_anndata( + adata, + layer="counts", + batch_key="batch", + categorical_covariate_keys=["condition"] # control vs treated +) + +model = scvi.model.CONTRASTIVEVI( + adata, + n_latent=10, # Shared variation + n_latent_target=5 # Target-specific variation +) +model.train() + +# Extract representations +shared = model.get_latent_representation(representation="shared") +target_specific = model.get_latent_representation(representation="target") +``` + +## CellAssign + +**Purpose**: Marker-based cell type annotation using known marker genes. + +**Key Features**: +- Uses prior knowledge of marker genes for cell types +- Probabilistic assignment of cells to types +- Handles marker gene overlap and ambiguity +- Provides soft assignments with uncertainty + +**When to Use**: +- Annotating cells using known marker genes +- Leveraging existing biological knowledge for classification +- Cases where marker gene lists are available but reference datasets are not + +**Basic Usage**: +```python +# Create marker gene matrix (cell types x genes) +marker_gene_mat = pd.DataFrame({ + "CD4 T cells": [1, 1, 0, 0], # CD3D, CD4, CD8A, CD19 + "CD8 T cells": [1, 0, 1, 0], + "B cells": [0, 0, 0, 1] +}, index=["CD3D", "CD4", "CD8A", "CD19"]) + +scvi.model.CELLASSIGN.setup_anndata(adata, layer="counts") +model = scvi.model.CELLASSIGN(adata, marker_gene_mat) +model.train() + +predictions = model.predict() +``` + +## Solo (Doublet Detection) + +**Purpose**: Identifying doublets (cells containing two or more cells) in scRNA-seq data. + +**Key Features**: +- Semi-supervised doublet detection using scVI embeddings +- Simulates artificial doublets for training +- Provides doublet probability scores +- Can be applied to any scVI model + +**When to Use**: +- Quality control of scRNA-seq datasets +- Removing doublets before downstream analysis +- Assessing doublet rates in your data + +**Basic Usage**: +```python +# Train scVI model first +scvi.model.SCVI.setup_anndata(adata, layer="counts") +scvi_model = scvi.model.SCVI(adata) +scvi_model.train() + +# Train Solo for doublet detection +solo_model = scvi.external.SOLO.from_scvi_model(scvi_model) +solo_model.train() + +# Predict doublets +predictions = solo_model.predict() +doublet_scores = predictions["doublet"] +adata.obs["doublet_score"] = doublet_scores +``` + +## Amortized LDA (Topic Modeling) + +**Purpose**: Topic modeling for gene expression using Latent Dirichlet Allocation. + +**Key Features**: +- Discovers gene expression programs (topics) +- Amortized variational inference for scalability +- Each cell is a mixture of topics +- Each topic is a distribution over genes + +**When to Use**: +- Discovering gene programs or expression modules +- Understanding compositional structure of expression +- Alternative dimensionality reduction approach +- Interpretable decomposition of expression patterns + +**Basic Usage**: +```python +scvi.model.AMORTIZEDLDA.setup_anndata(adata, layer="counts") +model = scvi.model.AMORTIZEDLDA(adata, n_topics=10) +model.train() + +# Get topic compositions per cell +topic_proportions = model.get_latent_representation() + +# Get gene loadings per topic +topic_gene_loadings = model.get_topic_distribution() +``` + +## Model Selection Guidelines + +**Choose scVI when**: +- Starting with unsupervised analysis +- Need batch correction and integration +- Want normalized expression and DE analysis + +**Choose scANVI when**: +- Have some labeled cells for training +- Need cell type annotation +- Want to transfer labels from reference to query + +**Choose AUTOZI when**: +- Concerned about technical dropout +- Need to identify zero-inflated genes +- Working with very sparse datasets + +**Choose VeloVI when**: +- Have spliced/unspliced count data +- Interested in cellular dynamics +- Need RNA velocity with batch correction + +**Choose contrastiveVI when**: +- Analyzing perturbation experiments +- Need to separate treatment effects +- Want to identify condition-specific programs + +**Choose CellAssign when**: +- Have marker gene lists available +- Want probabilistic marker-based annotation +- No reference dataset available + +**Choose Solo when**: +- Need doublet detection +- Already using scVI for analysis +- Want probabilistic doublet scores diff --git a/scientific-packages/scvi-tools/references/models-spatial.md b/scientific-packages/scvi-tools/references/models-spatial.md new file mode 100644 index 0000000..0d3e4d3 --- /dev/null +++ b/scientific-packages/scvi-tools/references/models-spatial.md @@ -0,0 +1,438 @@ +# Spatial Transcriptomics Models + +This document covers models for analyzing spatially-resolved transcriptomics data in scvi-tools. + +## DestVI (Deconvolution of Spatial Transcriptomics using Variational Inference) + +**Purpose**: Multi-resolution deconvolution of spatial transcriptomics using single-cell reference data. + +**Key Features**: +- Estimates cell type proportions at each spatial location +- Uses single-cell RNA-seq reference for deconvolution +- Multi-resolution approach (global and local patterns) +- Accounts for spatial correlation +- Provides uncertainty quantification + +**When to Use**: +- Deconvolving Visium or similar spatial transcriptomics +- Have scRNA-seq reference data with cell type labels +- Want to map cell types to spatial locations +- Interested in spatial organization of cell types +- Need probabilistic estimates of cell type abundance + +**Data Requirements**: +- **Spatial data**: Visium or similar spot-based measurements (target data) +- **Single-cell reference**: scRNA-seq with cell type annotations +- Both datasets should share genes + +**Basic Usage**: +```python +import scvi + +# Step 1: Train scVI on single-cell reference +scvi.model.SCVI.setup_anndata(sc_adata, layer="counts") +sc_model = scvi.model.SCVI(sc_adata) +sc_model.train() + +# Step 2: Setup spatial data +scvi.model.DESTVI.setup_anndata( + spatial_adata, + layer="counts" +) + +# Step 3: Train DestVI using reference +model = scvi.model.DESTVI.from_rna_model( + spatial_adata, + sc_model, + cell_type_key="cell_type" # Cell type labels in reference +) +model.train(max_epochs=2500) + +# Step 4: Get cell type proportions +proportions = model.get_proportions() +spatial_adata.obsm["proportions"] = proportions + +# Step 5: Get cell type-specific expression +# Expression of genes specific to each cell type at each spot +ct_expression = model.get_scale_for_ct("T cells") +``` + +**Key Parameters**: +- `amortization`: Amortization strategy ("both", "latent", "proportion") +- `n_latent`: Latent dimensionality (inherited from scVI model) + +**Outputs**: +- `get_proportions()`: Cell type proportions at each spot +- `get_scale_for_ct(cell_type)`: Cell type-specific expression patterns +- `get_gamma()`: Proportion-specific gene expression scaling + +**Visualization**: +```python +import scanpy as sc +import matplotlib.pyplot as plt + +# Visualize specific cell type proportions spatially +sc.pl.spatial( + spatial_adata, + color="T cells", # If proportions added to .obs + spot_size=150 +) + +# Or use obsm directly +for ct in cell_types: + plt.figure() + sc.pl.spatial( + spatial_adata, + color=spatial_adata.obsm["proportions"][ct], + title=f"{ct} proportions" + ) +``` + +## Stereoscope + +**Purpose**: Cell type deconvolution for spatial transcriptomics using probabilistic modeling. + +**Key Features**: +- Reference-based deconvolution +- Probabilistic framework for cell type proportions +- Works with various spatial technologies +- Handles gene selection and normalization + +**When to Use**: +- Similar to DestVI but simpler approach +- Deconvolving spatial data with reference +- Faster alternative for basic deconvolution + +**Basic Usage**: +```python +scvi.model.STEREOSCOPE.setup_anndata( + sc_adata, + labels_key="cell_type", + layer="counts" +) + +# Train on reference +ref_model = scvi.model.STEREOSCOPE(sc_adata) +ref_model.train() + +# Setup spatial data +scvi.model.STEREOSCOPE.setup_anndata(spatial_adata, layer="counts") + +# Transfer to spatial +spatial_model = scvi.model.STEREOSCOPE.from_reference_model( + spatial_adata, + ref_model +) +spatial_model.train() + +# Get proportions +proportions = spatial_model.get_proportions() +``` + +## Tangram + +**Purpose**: Spatial mapping and integration of single-cell data to spatial locations. + +**Key Features**: +- Maps single cells to spatial coordinates +- Learns optimal transport between single-cell and spatial data +- Gene imputation at spatial locations +- Cell type mapping + +**When to Use**: +- Mapping cells from scRNA-seq to spatial locations +- Imputing unmeasured genes in spatial data +- Understanding spatial organization at single-cell resolution +- Integrating scRNA-seq and spatial transcriptomics + +**Data Requirements**: +- Single-cell RNA-seq data with annotations +- Spatial transcriptomics data +- Shared genes between modalities + +**Basic Usage**: +```python +import tangram as tg + +# Map cells to spatial locations +ad_map = tg.map_cells_to_space( + adata_sc=sc_adata, + adata_sp=spatial_adata, + mode="cells", # or "clusters" for cell type mapping + density_prior="rna_count_based" +) + +# Get mapping matrix (cells × spots) +mapping = ad_map.X + +# Project cell annotations to space +tg.project_cell_annotations( + ad_map, + spatial_adata, + annotation="cell_type" +) + +# Impute genes in spatial data +genes_to_impute = ["CD3D", "CD8A", "CD4"] +tg.project_genes(ad_map, spatial_adata, genes=genes_to_impute) +``` + +**Visualization**: +```python +# Visualize cell type mapping +sc.pl.spatial( + spatial_adata, + color="cell_type_projected", + spot_size=100 +) +``` + +## gimVI (Gaussian Identity Multivi for Imputation) + +**Purpose**: Cross-modality imputation between spatial and single-cell data. + +**Key Features**: +- Joint model of spatial and single-cell data +- Imputes missing genes in spatial data +- Enables cross-dataset queries +- Learns shared representations + +**When to Use**: +- Imputing genes not measured in spatial data +- Joint analysis of spatial and single-cell datasets +- Mapping between modalities + +**Basic Usage**: +```python +# Combine datasets +combined_adata = sc.concat([sc_adata, spatial_adata]) + +scvi.model.GIMVI.setup_anndata( + combined_adata, + layer="counts" +) + +model = scvi.model.GIMVI(combined_adata) +model.train() + +# Impute genes in spatial data +imputed = model.get_imputed_values(spatial_indices) +``` + +## scVIVA (Variation in Variational Autoencoders for Spatial) + +**Purpose**: Analyzing cell-environment relationships in spatial data. + +**Key Features**: +- Models cellular neighborhoods and environments +- Identifies environment-associated gene expression +- Accounts for spatial correlation structure +- Cell-cell interaction analysis + +**When to Use**: +- Understanding how spatial context affects cells +- Identifying niche-specific gene programs +- Cell-cell interaction studies +- Microenvironment analysis + +**Data Requirements**: +- Spatial transcriptomics with coordinates +- Cell type annotations (optional) + +**Basic Usage**: +```python +scvi.model.SCVIVA.setup_anndata( + spatial_adata, + layer="counts", + spatial_key="spatial" # Coordinates in .obsm +) + +model = scvi.model.SCVIVA(spatial_adata) +model.train() + +# Get environment representations +env_latent = model.get_environment_representation() + +# Identify environment-associated genes +env_genes = model.get_environment_specific_genes() +``` + +## ResolVI + +**Purpose**: Addressing spatial transcriptomics noise through resolution-aware modeling. + +**Key Features**: +- Accounts for spatial resolution effects +- Denoises spatial data +- Multi-scale analysis +- Improves downstream analysis quality + +**When to Use**: +- Noisy spatial data +- Multiple spatial resolutions +- Need denoising before analysis +- Improving data quality + +**Basic Usage**: +```python +scvi.model.RESOLVI.setup_anndata( + spatial_adata, + layer="counts", + spatial_key="spatial" +) + +model = scvi.model.RESOLVI(spatial_adata) +model.train() + +# Get denoised expression +denoised = model.get_denoised_expression() +``` + +## Model Selection for Spatial Transcriptomics + +### DestVI +**Choose when**: +- Need detailed deconvolution with reference +- Have high-quality scRNA-seq reference +- Want multi-resolution analysis +- Need uncertainty quantification + +**Best for**: Visium, spot-based technologies + +### Stereoscope +**Choose when**: +- Need simpler, faster deconvolution +- Basic cell type proportion estimates +- Limited computational resources + +**Best for**: Quick deconvolution tasks + +### Tangram +**Choose when**: +- Want single-cell resolution mapping +- Need to impute many genes +- Interested in cell positioning +- Optimal transport approach preferred + +**Best for**: Detailed spatial mapping + +### gimVI +**Choose when**: +- Need bidirectional imputation +- Joint modeling of spatial and single-cell +- Cross-dataset queries + +**Best for**: Integration and imputation + +### scVIVA +**Choose when**: +- Interested in cellular environments +- Cell-cell interaction analysis +- Neighborhood effects + +**Best for**: Microenvironment studies + +### ResolVI +**Choose when**: +- Data quality is a concern +- Need denoising +- Multi-scale analysis + +**Best for**: Noisy data preprocessing + +## Complete Workflow: Spatial Deconvolution with DestVI + +```python +import scvi +import scanpy as sc +import squidpy as sq + +# ===== Part 1: Prepare single-cell reference ===== +# Load and process scRNA-seq reference +sc_adata = sc.read_h5ad("reference_scrna.h5ad") + +# QC and filtering +sc.pp.filter_genes(sc_adata, min_cells=10) +sc.pp.highly_variable_genes(sc_adata, n_top_genes=4000) + +# Train scVI on reference +scvi.model.SCVI.setup_anndata( + sc_adata, + layer="counts", + batch_key="batch" +) + +sc_model = scvi.model.SCVI(sc_adata) +sc_model.train(max_epochs=400) + +# ===== Part 2: Load spatial data ===== +spatial_adata = sc.read_visium("path/to/visium") +spatial_adata.var_names_make_unique() + +# QC spatial data +sc.pp.filter_genes(spatial_adata, min_cells=10) + +# ===== Part 3: Run DestVI ===== +scvi.model.DESTVI.setup_anndata( + spatial_adata, + layer="counts" +) + +destvi_model = scvi.model.DESTVI.from_rna_model( + spatial_adata, + sc_model, + cell_type_key="cell_type" +) + +destvi_model.train(max_epochs=2500) + +# ===== Part 4: Extract results ===== +# Get proportions +proportions = destvi_model.get_proportions() +spatial_adata.obsm["proportions"] = proportions + +# Add proportions to .obs for easy plotting +for i, ct in enumerate(sc_model.adata.obs["cell_type"].cat.categories): + spatial_adata.obs[f"prop_{ct}"] = proportions[:, i] + +# ===== Part 5: Visualization ===== +# Plot specific cell types +cell_types = ["T cells", "B cells", "Macrophages"] + +for ct in cell_types: + sc.pl.spatial( + spatial_adata, + color=f"prop_{ct}", + title=f"{ct} proportions", + spot_size=150, + cmap="viridis" + ) + +# ===== Part 6: Spatial analysis ===== +# Compute spatial neighbors +sq.gr.spatial_neighbors(spatial_adata) + +# Spatial autocorrelation of cell types +for ct in cell_types: + sq.gr.spatial_autocorr( + spatial_adata, + attr="obs", + mode="moran", + genes=[f"prop_{ct}"] + ) + +# ===== Part 7: Save results ===== +destvi_model.save("destvi_model") +spatial_adata.write("spatial_deconvolved.h5ad") +``` + +## Best Practices for Spatial Analysis + +1. **Reference quality**: Use high-quality, well-annotated scRNA-seq reference +2. **Gene overlap**: Ensure sufficient shared genes between reference and spatial +3. **Spatial coordinates**: Properly register spatial coordinates in `.obsm["spatial"]` +4. **Validation**: Use known marker genes to validate deconvolution +5. **Visualization**: Always visualize results spatially to check biological plausibility +6. **Cell type granularity**: Consider appropriate cell type resolution +7. **Computational resources**: Spatial models can be memory-intensive +8. **Quality control**: Filter low-quality spots before analysis diff --git a/scientific-packages/scvi-tools/references/models-specialized.md b/scientific-packages/scvi-tools/references/models-specialized.md new file mode 100644 index 0000000..b791789 --- /dev/null +++ b/scientific-packages/scvi-tools/references/models-specialized.md @@ -0,0 +1,408 @@ +# Specialized Modality Models + +This document covers models for specialized single-cell data modalities in scvi-tools. + +## MethylVI / MethylANVI (Methylation Analysis) + +**Purpose**: Analysis of single-cell bisulfite sequencing (scBS-seq) data for DNA methylation. + +**Key Features**: +- Models methylation patterns at single-cell resolution +- Handles sparsity in methylation data +- Batch correction for methylation experiments +- Label transfer (MethylANVI) for cell type annotation + +**When to Use**: +- Analyzing scBS-seq or similar methylation data +- Studying DNA methylation patterns across cell types +- Integrating methylation data across batches +- Cell type annotation based on methylation profiles + +**Data Requirements**: +- Methylation count matrices (methylated vs. total reads per CpG site) +- Format: Cells × CpG sites with methylation ratios or counts + +### MethylVI (Unsupervised) + +**Basic Usage**: +```python +import scvi + +# Setup methylation data +scvi.model.METHYLVI.setup_anndata( + adata, + layer="methylation_counts", # Methylation data + batch_key="batch" +) + +model = scvi.model.METHYLVI(adata) +model.train() + +# Get latent representation +latent = model.get_latent_representation() + +# Get normalized methylation values +normalized_meth = model.get_normalized_methylation() +``` + +### MethylANVI (Semi-supervised with cell types) + +**Basic Usage**: +```python +# Setup with cell type labels +scvi.model.METHYLANVI.setup_anndata( + adata, + layer="methylation_counts", + batch_key="batch", + labels_key="cell_type", + unlabeled_category="Unknown" +) + +model = scvi.model.METHYLANVI(adata) +model.train() + +# Predict cell types +predictions = model.predict() +``` + +**Key Parameters**: +- `n_latent`: Latent dimensionality +- `region_factors`: Model region-specific effects + +**Use Cases**: +- Epigenetic heterogeneity analysis +- Cell type identification via methylation +- Integration with gene expression data (separate analysis) +- Differential methylation analysis + +## CytoVI (Flow and Mass Cytometry) + +**Purpose**: Batch correction and integration of flow cytometry and mass cytometry (CyTOF) data. + +**Key Features**: +- Handles antibody-based protein measurements +- Corrects batch effects in cytometry data +- Enables integration across experiments +- Designed for high-dimensional protein panels + +**When to Use**: +- Analyzing flow cytometry or CyTOF data +- Integrating cytometry experiments across batches +- Batch correction for protein panels +- Cross-study cytometry integration + +**Data Requirements**: +- Protein expression matrix (cells × proteins) +- Flow cytometry or CyTOF measurements +- Batch/experiment annotations + +**Basic Usage**: +```python +scvi.model.CYTOVI.setup_anndata( + adata, + protein_expression_obsm_key="protein_expression", + batch_key="batch" +) + +model = scvi.model.CYTOVI(adata) +model.train() + +# Get batch-corrected representation +latent = model.get_latent_representation() + +# Get normalized protein values +normalized = model.get_normalized_expression() +``` + +**Key Parameters**: +- `n_latent`: Latent space dimensionality +- `n_layers`: Network depth + +**Typical Workflow**: +```python +import scanpy as sc + +# 1. Load cytometry data +adata = sc.read_h5ad("cytof_data.h5ad") + +# 2. Train CytoVI +scvi.model.CYTOVI.setup_anndata( + adata, + protein_expression_obsm_key="protein", + batch_key="experiment" +) +model = scvi.model.CYTOVI(adata) +model.train() + +# 3. Get batch-corrected values +latent = model.get_latent_representation() +adata.obsm["X_CytoVI"] = latent + +# 4. Downstream analysis +sc.pp.neighbors(adata, use_rep="X_CytoVI") +sc.tl.umap(adata) +sc.tl.leiden(adata) + +# 5. Visualize batch correction +sc.pl.umap(adata, color=["batch", "leiden"]) +``` + +## SysVI (Systems-level Integration) + +**Purpose**: Batch effect correction with emphasis on preserving biological variation. + +**Key Features**: +- Specialized batch integration approach +- Preserves biological signals while removing technical effects +- Designed for large-scale integration studies + +**When to Use**: +- Large-scale multi-batch integration +- Need to preserve subtle biological variation +- Systems-level analysis across many studies + +**Basic Usage**: +```python +scvi.model.SYSVI.setup_anndata( + adata, + layer="counts", + batch_key="batch" +) + +model = scvi.model.SYSVI(adata) +model.train() + +latent = model.get_latent_representation() +``` + +## Decipher (Trajectory Inference) + +**Purpose**: Trajectory inference and pseudotime analysis for single-cell data. + +**Key Features**: +- Learns cellular trajectories and differentiation paths +- Pseudotime estimation +- Accounts for uncertainty in trajectory structure +- Compatible with scVI embeddings + +**When to Use**: +- Studying cellular differentiation +- Time-course or developmental datasets +- Understanding cell state transitions +- Identifying branching points in development + +**Basic Usage**: +```python +# Typically used after scVI for embeddings +scvi_model = scvi.model.SCVI(adata) +scvi_model.train() + +# Decipher for trajectory +scvi.model.DECIPHER.setup_anndata(adata) +decipher_model = scvi.model.DECIPHER(adata, scvi_model) +decipher_model.train() + +# Get pseudotime +pseudotime = decipher_model.get_pseudotime() +adata.obs["pseudotime"] = pseudotime +``` + +**Visualization**: +```python +import scanpy as sc + +# Plot pseudotime on UMAP +sc.pl.umap(adata, color="pseudotime", cmap="viridis") + +# Gene expression along pseudotime +sc.pl.scatter(adata, x="pseudotime", y="gene_of_interest") +``` + +## peRegLM (Peak Regulatory Linear Model) + +**Purpose**: Linking chromatin accessibility to gene expression for regulatory analysis. + +**Key Features**: +- Links ATAC-seq peaks to gene expression +- Identifies regulatory relationships +- Works with paired multiome data + +**When to Use**: +- Multiome data (RNA + ATAC from same cells) +- Understanding gene regulation +- Linking peaks to target genes +- Regulatory network construction + +**Basic Usage**: +```python +# Requires paired RNA + ATAC data +scvi.model.PEREGLM.setup_anndata( + multiome_adata, + rna_layer="counts", + atac_layer="atac_counts" +) + +model = scvi.model.PEREGLM(multiome_adata) +model.train() + +# Get peak-gene links +peak_gene_links = model.get_regulatory_links() +``` + +## Model-Specific Best Practices + +### MethylVI/MethylANVI +1. **Sparsity**: Methylation data is inherently sparse; model accounts for this +2. **CpG selection**: Filter CpGs with very low coverage +3. **Biological interpretation**: Consider genomic context (promoters, enhancers) +4. **Integration**: For multi-omics, analyze separately then integrate results + +### CytoVI +1. **Protein QC**: Remove low-quality or uninformative proteins +2. **Compensation**: Ensure proper spectral compensation before analysis +3. **Batch design**: Include biological and technical replicates +4. **Controls**: Use control samples to validate batch correction + +### SysVI +1. **Sample size**: Designed for large-scale integration +2. **Batch definition**: Carefully define batch structure +3. **Biological validation**: Verify biological signals preserved + +### Decipher +1. **Start point**: Define trajectory start cells if known +2. **Branching**: Specify expected number of branches +3. **Validation**: Use known markers to validate pseudotime +4. **Integration**: Works well with scVI embeddings + +## Integration with Other Models + +Many specialized models work well in combination: + +**Methylation + Expression**: +```python +# Analyze separately, then integrate +methylvi_model = scvi.model.METHYLVI(meth_adata) +scvi_model = scvi.model.SCVI(rna_adata) + +# Integrate results at analysis level +# E.g., correlate methylation and expression patterns +``` + +**Cytometry + CITE-seq**: +```python +# CytoVI for flow/CyTOF +cyto_model = scvi.model.CYTOVI(cyto_adata) + +# totalVI for CITE-seq +cite_model = scvi.model.TOTALVI(cite_adata) + +# Compare protein measurements across platforms +``` + +**ATAC + RNA (Multiome)**: +```python +# MultiVI for joint analysis +multivi_model = scvi.model.MULTIVI(multiome_adata) + +# peRegLM for regulatory links +pereglm_model = scvi.model.PEREGLM(multiome_adata) +``` + +## Choosing Specialized Models + +### Decision Tree + +1. **What data modality?** + - Methylation → MethylVI/MethylANVI + - Flow/CyTOF → CytoVI + - Trajectory → Decipher + - Multi-batch integration → SysVI + - Regulatory links → peRegLM + +2. **Do you have labels?** + - Yes → MethylANVI (methylation) + - No → MethylVI (methylation) + +3. **What's your main goal?** + - Batch correction → CytoVI, SysVI + - Trajectory/pseudotime → Decipher + - Peak-gene links → peRegLM + - Methylation patterns → MethylVI/ANVI + +## Example: Complete Methylation Analysis + +```python +import scvi +import scanpy as sc + +# 1. Load methylation data +meth_adata = sc.read_h5ad("methylation_data.h5ad") + +# 2. QC: filter low-coverage CpG sites +sc.pp.filter_genes(meth_adata, min_cells=10) + +# 3. Setup MethylVI +scvi.model.METHYLVI.setup_anndata( + meth_adata, + layer="methylation", + batch_key="batch" +) + +# 4. Train model +model = scvi.model.METHYLVI(meth_adata, n_latent=15) +model.train(max_epochs=400) + +# 5. Get latent representation +latent = model.get_latent_representation() +meth_adata.obsm["X_MethylVI"] = latent + +# 6. Clustering +sc.pp.neighbors(meth_adata, use_rep="X_MethylVI") +sc.tl.umap(meth_adata) +sc.tl.leiden(meth_adata) + +# 7. Differential methylation +dm_results = model.differential_methylation( + groupby="leiden", + group1="0", + group2="1" +) + +# 8. Save +model.save("methylvi_model") +meth_adata.write("methylation_analyzed.h5ad") +``` + +## External Tools Integration + +Some specialized models are available as external packages: + +**SOLO** (doublet detection): +```python +from scvi.external import SOLO + +solo = SOLO.from_scvi_model(scvi_model) +solo.train() +doublets = solo.predict() +``` + +**scArches** (reference mapping): +```python +from scvi.external import SCARCHES + +# For transfer learning and query-to-reference mapping +``` + +These external tools extend scvi-tools functionality for specific use cases. + +## Summary Table + +| Model | Data Type | Primary Use | Supervised? | +|-------|-----------|-------------|-------------| +| MethylVI | Methylation | Unsupervised analysis | No | +| MethylANVI | Methylation | Cell type annotation | Semi | +| CytoVI | Cytometry | Batch correction | No | +| SysVI | scRNA-seq | Large-scale integration | No | +| Decipher | scRNA-seq | Trajectory inference | No | +| peRegLM | Multiome | Peak-gene links | No | +| SOLO | scRNA-seq | Doublet detection | Semi | diff --git a/scientific-packages/scvi-tools/references/workflows.md b/scientific-packages/scvi-tools/references/workflows.md new file mode 100644 index 0000000..d1fa86d --- /dev/null +++ b/scientific-packages/scvi-tools/references/workflows.md @@ -0,0 +1,546 @@ +# Common Workflows and Best Practices + +This document covers common workflows, best practices, and advanced usage patterns for scvi-tools. + +## Standard Analysis Workflow + +### 1. Data Loading and Preparation + +```python +import scvi +import scanpy as sc +import numpy as np + +# Load data (AnnData format required) +adata = sc.read_h5ad("data.h5ad") +# Or load from other formats +# adata = sc.read_10x_mtx("filtered_feature_bc_matrix/") +# adata = sc.read_csv("counts.csv") + +# Basic QC metrics +sc.pp.calculate_qc_metrics(adata, inplace=True) +adata.var['mt'] = adata.var_names.str.startswith('MT-') +sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], inplace=True) +``` + +### 2. Quality Control + +```python +# Filter cells +sc.pp.filter_cells(adata, min_genes=200) +sc.pp.filter_cells(adata, max_genes=5000) + +# Filter genes +sc.pp.filter_genes(adata, min_cells=3) + +# Filter by mitochondrial content +adata = adata[adata.obs['pct_counts_mt'] < 20, :] + +# Remove doublets (optional, before training) +sc.external.pp.scrublet(adata) +adata = adata[~adata.obs['predicted_doublet'], :] +``` + +### 3. Preprocessing for scvi-tools + +```python +# IMPORTANT: scvi-tools needs RAW counts +# If you've already normalized, use the raw layer or reload data + +# Save raw counts if not already available +if 'counts' not in adata.layers: + adata.layers['counts'] = adata.X.copy() + +# Feature selection (optional but recommended) +sc.pp.highly_variable_genes( + adata, + n_top_genes=4000, + subset=False, # Keep all genes, just mark HVGs + batch_key="batch" # If multiple batches +) + +# Filter to HVGs (optional) +# adata = adata[:, adata.var['highly_variable']] +``` + +### 4. Register Data with scvi-tools + +```python +# Setup AnnData for scvi-tools +scvi.model.SCVI.setup_anndata( + adata, + layer="counts", # Use raw counts + batch_key="batch", # Technical batches + categorical_covariate_keys=["donor", "condition"], + continuous_covariate_keys=["percent_mito", "n_counts"] +) + +# Check registration +adata.uns['_scvi']['summary_stats'] +``` + +### 5. Model Training + +```python +# Create model +model = scvi.model.SCVI( + adata, + n_latent=30, # Latent dimensions + n_layers=2, # Network depth + n_hidden=128, # Hidden layer size + dropout_rate=0.1, + gene_likelihood="zinb" # zero-inflated negative binomial +) + +# Train model +model.train( + max_epochs=400, + batch_size=128, + train_size=0.9, + early_stopping=True, + check_val_every_n_epoch=10 +) + +# View training history +train_history = model.history["elbo_train"] +val_history = model.history["elbo_validation"] +``` + +### 6. Extract Results + +```python +# Get latent representation +latent = model.get_latent_representation() +adata.obsm["X_scVI"] = latent + +# Get normalized expression +normalized = model.get_normalized_expression( + adata, + library_size=1e4, + n_samples=25 # Monte Carlo samples +) +adata.layers["scvi_normalized"] = normalized +``` + +### 7. Downstream Analysis + +```python +# Clustering on scVI latent space +sc.pp.neighbors(adata, use_rep="X_scVI", n_neighbors=15) +sc.tl.umap(adata, min_dist=0.3) +sc.tl.leiden(adata, resolution=0.8, key_added="leiden") + +# Visualization +sc.pl.umap(adata, color=["leiden", "batch", "cell_type"]) + +# Differential expression +de_results = model.differential_expression( + groupby="leiden", + group1="0", + group2="1", + mode="change", + delta=0.25 +) +``` + +### 8. Model Persistence + +```python +# Save model +model_dir = "./scvi_model/" +model.save(model_dir, overwrite=True) + +# Save AnnData with results +adata.write("analyzed_data.h5ad") + +# Load model later +model = scvi.model.SCVI.load(model_dir, adata=adata) +``` + +## Hyperparameter Tuning + +### Key Hyperparameters + +**Architecture**: +- `n_latent`: Latent space dimensionality (10-50) + - Larger for complex, heterogeneous datasets + - Smaller for simple datasets or to prevent overfitting +- `n_layers`: Number of hidden layers (1-3) + - More layers for complex data, but diminishing returns +- `n_hidden`: Nodes per hidden layer (64-256) + - Scale with dataset size and complexity + +**Training**: +- `max_epochs`: Training iterations (200-500) + - Use early stopping to prevent overfitting +- `batch_size`: Samples per batch (64-256) + - Larger for big datasets, smaller for limited memory +- `lr`: Learning rate (0.001 default, usually good) + +**Model-specific**: +- `gene_likelihood`: Distribution ("zinb", "nb", "poisson") + - "zinb" for sparse data with zero-inflation + - "nb" for less sparse data +- `dispersion`: Gene or gene-batch specific + - "gene" for simple, "gene-batch" for complex batch effects + +### Hyperparameter Search Example + +```python +from scvi.model import SCVI + +# Define search space +latent_dims = [10, 20, 30] +n_layers_options = [1, 2] + +best_score = float('-inf') +best_params = None + +for n_latent in latent_dims: + for n_layers in n_layers_options: + model = SCVI( + adata, + n_latent=n_latent, + n_layers=n_layers + ) + model.train(max_epochs=200) + + # Evaluate on validation set + val_elbo = model.history["elbo_validation"][-1] + + if val_elbo > best_score: + best_score = val_elbo + best_params = {"n_latent": n_latent, "n_layers": n_layers} + +print(f"Best params: {best_params}") +``` + +### Using Optuna for Hyperparameter Optimization + +```python +import optuna + +def objective(trial): + n_latent = trial.suggest_int("n_latent", 10, 50) + n_layers = trial.suggest_int("n_layers", 1, 3) + n_hidden = trial.suggest_categorical("n_hidden", [64, 128, 256]) + + model = scvi.model.SCVI( + adata, + n_latent=n_latent, + n_layers=n_layers, + n_hidden=n_hidden + ) + + model.train(max_epochs=200, early_stopping=True) + return model.history["elbo_validation"][-1] + +study = optuna.create_study(direction="maximize") +study.optimize(objective, n_trials=20) + +print(f"Best parameters: {study.best_params}") +``` + +## GPU Acceleration + +### Enable GPU Training + +```python +# Automatic GPU detection +model = scvi.model.SCVI(adata) +model.train(accelerator="auto") # Uses GPU if available + +# Force GPU +model.train(accelerator="gpu") + +# Multi-GPU +model.train(accelerator="gpu", devices=2) + +# Check if GPU is being used +import torch +print(f"CUDA available: {torch.cuda.is_available()}") +print(f"GPU count: {torch.cuda.device_count()}") +``` + +### GPU Memory Management + +```python +# Reduce batch size if OOM +model.train(batch_size=64) # Instead of default 128 + +# Mixed precision training (saves memory) +model.train(precision=16) + +# Clear cache between runs +import torch +torch.cuda.empty_cache() +``` + +## Batch Integration Strategies + +### Strategy 1: Simple Batch Key + +```python +# For standard batch correction +scvi.model.SCVI.setup_anndata(adata, batch_key="batch") +model = scvi.model.SCVI(adata) +``` + +### Strategy 2: Multiple Covariates + +```python +# Correct for multiple technical factors +scvi.model.SCVI.setup_anndata( + adata, + batch_key="sequencing_batch", + categorical_covariate_keys=["donor", "tissue"], + continuous_covariate_keys=["percent_mito"] +) +``` + +### Strategy 3: Hierarchical Batches + +```python +# When batches have hierarchical structure +# E.g., samples within studies +adata.obs["batch_hierarchy"] = ( + adata.obs["study"].astype(str) + "_" + + adata.obs["sample"].astype(str) +) + +scvi.model.SCVI.setup_anndata(adata, batch_key="batch_hierarchy") +``` + +## Reference Mapping (scArches) + +### Training Reference Model + +```python +# Train on reference dataset +scvi.model.SCVI.setup_anndata(ref_adata, batch_key="batch") +ref_model = scvi.model.SCVI(ref_adata) +ref_model.train() + +# Save reference +ref_model.save("reference_model") +``` + +### Mapping Query to Reference + +```python +# Load reference +ref_model = scvi.model.SCVI.load("reference_model", adata=ref_adata) + +# Setup query with same parameters +scvi.model.SCVI.setup_anndata(query_adata, batch_key="batch") + +# Transfer learning +query_model = scvi.model.SCVI.load_query_data( + query_adata, + "reference_model" +) + +# Fine-tune on query (optional) +query_model.train(max_epochs=200) + +# Get query embeddings +query_latent = query_model.get_latent_representation() + +# Transfer labels using KNN +from sklearn.neighbors import KNeighborsClassifier + +knn = KNeighborsClassifier(n_neighbors=15) +knn.fit(ref_model.get_latent_representation(), ref_adata.obs["cell_type"]) +query_adata.obs["predicted_cell_type"] = knn.predict(query_latent) +``` + +## Model Minification + +Reduce model size for faster inference: + +```python +# Train full model +model = scvi.model.SCVI(adata) +model.train() + +# Minify for deployment +minified = model.minify_adata(adata) + +# Save minified version +minified.write("minified_data.h5ad") +model.save("minified_model") + +# Load and use (much faster) +mini_model = scvi.model.SCVI.load("minified_model", adata=minified) +``` + +## Memory-Efficient Data Loading + +### Using AnnDataLoader + +```python +from scvi.data import AnnDataLoader + +# For very large datasets +dataloader = AnnDataLoader( + adata, + batch_size=128, + shuffle=True, + drop_last=False +) + +# Custom training loop (advanced) +for batch in dataloader: + # Process batch + pass +``` + +### Using Backed AnnData + +```python +# For data too large for memory +adata = sc.read_h5ad("huge_dataset.h5ad", backed='r') + +# scvi-tools works with backed mode +scvi.model.SCVI.setup_anndata(adata) +model = scvi.model.SCVI(adata) +model.train() +``` + +## Model Interpretation + +### Feature Importance with SHAP + +```python +import shap + +# Get SHAP values for interpretability +explainer = shap.DeepExplainer(model.module, background_data) +shap_values = explainer.shap_values(test_data) + +# Visualize +shap.summary_plot(shap_values, feature_names=adata.var_names) +``` + +### Gene Correlation Analysis + +```python +# Get gene-gene correlation matrix +correlation = model.get_feature_correlation_matrix( + adata, + transform_batch="batch1" +) + +# Visualize top correlated genes +import seaborn as sns +sns.heatmap(correlation[:50, :50], cmap="coolwarm") +``` + +## Troubleshooting Common Issues + +### Issue: NaN Loss During Training + +**Causes**: +- Learning rate too high +- Unnormalized input (must use raw counts) +- Data quality issues + +**Solutions**: +```python +# Reduce learning rate +model.train(lr=0.0001) + +# Check data +assert adata.X.min() >= 0 # No negative values +assert np.isnan(adata.X).sum() == 0 # No NaNs + +# Use more stable likelihood +model = scvi.model.SCVI(adata, gene_likelihood="nb") +``` + +### Issue: Poor Batch Correction + +**Solutions**: +```python +# Increase batch effect modeling +model = scvi.model.SCVI( + adata, + encode_covariates=True, # Encode batch in encoder + deeply_inject_covariates=False +) + +# Or try opposite +model = scvi.model.SCVI(adata, deeply_inject_covariates=True) + +# Use more latent dimensions +model = scvi.model.SCVI(adata, n_latent=50) +``` + +### Issue: Model Not Training (ELBO Not Decreasing) + +**Solutions**: +```python +# Increase learning rate +model.train(lr=0.005) + +# Increase network capacity +model = scvi.model.SCVI(adata, n_hidden=256, n_layers=2) + +# Train longer +model.train(max_epochs=500) +``` + +### Issue: Out of Memory (OOM) + +**Solutions**: +```python +# Reduce batch size +model.train(batch_size=64) + +# Use mixed precision +model.train(precision=16) + +# Reduce model size +model = scvi.model.SCVI(adata, n_latent=10, n_hidden=64) + +# Use backed AnnData +adata = sc.read_h5ad("data.h5ad", backed='r') +``` + +## Performance Benchmarking + +```python +import time + +# Time training +start = time.time() +model.train(max_epochs=400) +training_time = time.time() - start +print(f"Training time: {training_time:.2f}s") + +# Time inference +start = time.time() +latent = model.get_latent_representation() +inference_time = time.time() - start +print(f"Inference time: {inference_time:.2f}s") + +# Memory usage +import psutil +import os +process = psutil.Process(os.getpid()) +memory_gb = process.memory_info().rss / 1024**3 +print(f"Memory usage: {memory_gb:.2f} GB") +``` + +## Best Practices Summary + +1. **Always use raw counts**: Never log-normalize before scvi-tools +2. **Feature selection**: Use highly variable genes for efficiency +3. **Batch correction**: Register all known technical covariates +4. **Early stopping**: Use validation set to prevent overfitting +5. **Model saving**: Always save trained models +6. **GPU usage**: Use GPU for large datasets (>10k cells) +7. **Hyperparameter tuning**: Start with defaults, tune if needed +8. **Validation**: Check batch correction visually (UMAP colored by batch) +9. **Documentation**: Keep track of preprocessing steps +10. **Reproducibility**: Set random seeds (`scvi.settings.seed = 0`)