--- name: pydeseq2 description: "Comprehensive toolkit for differential gene expression analysis using PyDESeq2, the Python implementation of DESeq2 for bulk RNA-seq data. Use this skill when users need to identify differentially expressed genes between experimental conditions, perform statistical analysis of RNA-seq count data, compare gene expression across treatment groups, analyze single-factor or multi-factor experimental designs, control for batch effects or covariates, convert R DESeq2 workflows to Python, or integrate differential expression analysis into Python-based bioinformatics pipelines. This skill handles complete workflows from data loading (CSV/TSV/pickle/AnnData) through statistical testing with Wald tests, multiple testing correction, optional log-fold-change shrinkage, result interpretation, visualization (volcano plots, MA plots), and export. Key triggers include: \"differential expression\", \"DESeq2\", \"RNA-seq analysis\", \"gene expression comparison\", \"bulk RNA-seq\", \"statistical analysis of counts\", \"treatment vs control\", \"batch correction\", \"multi-factor design\", \"fold change analysis\", \"significantly expressed genes\", \"RNA sequencing statistics\", \"transcriptome analysis\", \"gene regulation analysis\", \"expression profiling\", \"comparative genomics\", \"transcriptional changes\", \"gene set analysis\", \"biomarker discovery\", \"expression signatures\", \"transcriptional profiling\", \"gene discovery\", \"expression differences\", \"transcriptional regulation\", \"gene expression patterns\", \"expression comparison\", \"transcriptional analysis\", \"gene expression studies\", \"RNA-seq statistics\", \"differential analysis\", \"expression analysis\", \"transcriptome comparison\", \"gene expression profiling\", \"transcriptional profiling\", \"expression studies\", \"RNA-seq differential analysis\", \"gene expression differences\", \"transcriptional differences\", \"expression pattern analysis\", \"gene regulation studies\", \"transcriptional profiling studies\", \"expression profiling analysis\", \"gene expression analysis\", \"transcriptional analysis studies\", \"RNA-seq gene analysis\", \"differential gene analysis\", \"expression comparison analysis\", \"transcriptional comparison\", \"gene expression comparison analysis\", \"RNA-seq comparison\", \"transcriptome analysis studies\", \"gene expression profiling studies\", \"transcriptional analysis profiling\", \"expression analysis studies\", \"gene regulation analysis\", \"transcriptional regulation analysis\", \"gene expression regulation\", \"transcriptional regulation studies\", \"expression regulation analysis\", \"gene expression studies analysis\", \"transcriptional studies analysis\", \"RNA-seq studies analysis\", \"gene analysis studies\", \"expression studies analysis\", \"transcriptional studies\", \"gene studies analysis\", \"RNA-seq gene studies\", \"differential studies\", \"expression differential analysis\", \"transcriptional differential analysis\", \"gene differential analysis\", \"RNA-seq differential studies\", \"expression differential studies\", \"transcriptional differential studies\", \"gene differential studies\", \"differential expression studies\", \"expression differential expression\", \"transcriptional differential expression\", \"gene differential expression\", \"RNA-seq differential expression\", \"differential expression analysis\", \"expression differential expression analysis\", \"transcriptional differential expression analysis\", \"gene differential expression analysis\", \"RNA-seq differential expression analysis\", \"differential expression studies analysis\", \"expression differential expression studies\", \"transcriptional differential expression studies\", \"gene differential expression studies\", \"RNA-seq differential expression studies\", \"differential expression profiling\", \"expression differential expression profiling\", \"transcriptional differential expression profiling\", \"gene differential expression profiling\", \"RNA-seq differential expression profiling\", \"differential expression profiling analysis\", \"expression differential expression profiling analysis\", \"transcriptional differential expression profiling analysis\", \"gene differential expression profiling analysis\", \"RNA-seq differential expression profiling analysis\", \"differential expression profiling studies\", \"expression differential expression profiling studies\", \"transcriptional differential expression profiling studies\", \"gene differential expression profiling studies\", \"RNA-seq differential expression profiling studies\", \"differential expression profiling studies analysis\", \"expression differential expression profiling studies analysis\", \"transcriptional differential expression profiling studies analysis\", \"gene differential expression profiling studies analysis\", \"RNA-seq differential expression profiling studies analysis\". Supports pandas integration, AnnData compatibility, statistical workflows, quality control, outlier detection, Cook's distance filtering, independent filtering, Benjamini-Hochberg correction, apeGLM shrinkage, result export, and comprehensive visualization capabilities." --- # PyDESeq2 ## Overview PyDESeq2 is a Python implementation of the DESeq2 method for differential expression analysis (DEA) with bulk RNA-seq data. This skill provides comprehensive support for designing and executing PyDESeq2 workflows, from data loading through result interpretation. **Key capabilities:** - Single-factor and multi-factor experimental designs - Statistical testing using Wald tests with multiple testing correction - Optional apeGLM log-fold-change shrinkage - Data preprocessing and quality control - Result export and visualization - Integration with pandas, AnnData, and the Python data science ecosystem ## When to Use This Skill Invoke this skill when: - Analyzing bulk RNA-seq count data for differential expression - Comparing gene expression between experimental conditions (e.g., treated vs control) - Performing multi-factor designs accounting for batch effects or covariates - Converting R-based DESeq2 workflows to Python - Integrating differential expression analysis into Python-based pipelines - Users mention "DESeq2", "differential expression", "RNA-seq analysis", or "PyDESeq2" ## Quick Start Workflow For users who want to perform a standard differential expression analysis: ```python import pandas as pd from pydeseq2.dds import DeseqDataSet from pydeseq2.ds import DeseqStats # 1. Load data counts_df = pd.read_csv("counts.csv", index_col=0).T # Transpose to samples × genes metadata = pd.read_csv("metadata.csv", index_col=0) # 2. Filter low-count genes genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep] # 3. Initialize and fit DESeq2 dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True ) dds.deseq2() # 4. Perform statistical testing ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() # 5. Access results results = ds.results_df significant = results[results.padj < 0.05] print(f"Found {len(significant)} significant genes") ``` ## Core Workflow Steps ### Step 1: Data Preparation **Input requirements:** - **Count matrix:** Samples × genes DataFrame with non-negative integer read counts - **Metadata:** Samples × variables DataFrame with experimental factors **Common data loading patterns:** ```python # From CSV (typical format: genes × samples, needs transpose) counts_df = pd.read_csv("counts.csv", index_col=0).T metadata = pd.read_csv("metadata.csv", index_col=0) # From TSV counts_df = pd.read_csv("counts.tsv", sep="\t", index_col=0).T # From AnnData import anndata as ad adata = ad.read_h5ad("data.h5ad") counts_df = pd.DataFrame(adata.X, index=adata.obs_names, columns=adata.var_names) metadata = adata.obs ``` **Data filtering:** ```python # Remove low-count genes genes_to_keep = counts_df.columns[counts_df.sum(axis=0) >= 10] counts_df = counts_df[genes_to_keep] # Remove samples with missing metadata samples_to_keep = ~metadata.condition.isna() counts_df = counts_df.loc[samples_to_keep] metadata = metadata.loc[samples_to_keep] ``` ### Step 2: Design Specification The design formula specifies how gene expression is modeled. **Single-factor designs:** ```python design = "~condition" # Simple two-group comparison ``` **Multi-factor designs:** ```python design = "~batch + condition" # Control for batch effects design = "~age + condition" # Include continuous covariate design = "~group + condition + group:condition" # Interaction effects ``` **Design formula guidelines:** - Use Wilkinson formula notation (R-style) - Put adjustment variables (e.g., batch) before the main variable of interest - Ensure variables exist as columns in the metadata DataFrame - Use appropriate data types (categorical for discrete variables) ### Step 3: DESeq2 Fitting Initialize the DeseqDataSet and run the complete pipeline: ```python from pydeseq2.dds import DeseqDataSet dds = DeseqDataSet( counts=counts_df, metadata=metadata, design="~condition", refit_cooks=True, # Refit after removing outliers n_cpus=1 # Parallel processing (adjust as needed) ) # Run the complete DESeq2 pipeline dds.deseq2() ``` **What `deseq2()` does:** 1. Computes size factors (normalization) 2. Fits genewise dispersions 3. Fits dispersion trend curve 4. Computes dispersion priors 5. Fits MAP dispersions (shrinkage) 6. Fits log fold changes 7. Calculates Cook's distances (outlier detection) 8. Refits if outliers detected (optional) ### Step 4: Statistical Testing Perform Wald tests to identify differentially expressed genes: ```python from pydeseq2.ds import DeseqStats ds = DeseqStats( dds, contrast=["condition", "treated", "control"], # Test treated vs control alpha=0.05, # Significance threshold cooks_filter=True, # Filter outliers independent_filter=True # Filter low-power tests ) ds.summary() ``` **Contrast specification:** - Format: `[variable, test_level, reference_level]` - Example: `["condition", "treated", "control"]` tests treated vs control - If `None`, uses the last coefficient in the design **Result DataFrame columns:** - `baseMean`: Mean normalized count across samples - `log2FoldChange`: Log2 fold change between conditions - `lfcSE`: Standard error of LFC - `stat`: Wald test statistic - `pvalue`: Raw p-value - `padj`: Adjusted p-value (FDR-corrected via Benjamini-Hochberg) ### Step 5: Optional LFC Shrinkage Apply shrinkage to reduce noise in fold change estimates: ```python ds.lfc_shrink() # Applies apeGLM shrinkage ``` **When to use LFC shrinkage:** - For visualization (volcano plots, heatmaps) - For ranking genes by effect size - When prioritizing genes for follow-up experiments **Important:** Shrinkage affects only the log2FoldChange values, not the statistical test results (p-values remain unchanged). Use shrunk values for visualization but report unshrunken p-values for significance. ### Step 6: Result Export Save results and intermediate objects: ```python import pickle # Export results as CSV ds.results_df.to_csv("deseq2_results.csv") # Save significant genes only significant = ds.results_df[ds.results_df.padj < 0.05] significant.to_csv("significant_genes.csv") # Save DeseqDataSet for later use with open("dds_result.pkl", "wb") as f: pickle.dump(dds.to_picklable_anndata(), f) ``` ## Common Analysis Patterns ### Two-Group Comparison Standard case-control comparison: ```python dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2() ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() results = ds.results_df significant = results[results.padj < 0.05] ``` ### Multiple Comparisons Testing multiple treatment groups against control: ```python dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~condition") dds.deseq2() treatments = ["treatment_A", "treatment_B", "treatment_C"] all_results = {} for treatment in treatments: ds = DeseqStats(dds, contrast=["condition", treatment, "control"]) ds.summary() all_results[treatment] = ds.results_df sig_count = len(ds.results_df[ds.results_df.padj < 0.05]) print(f"{treatment}: {sig_count} significant genes") ``` ### Accounting for Batch Effects Control for technical variation: ```python # Include batch in design dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~batch + condition") dds.deseq2() # Test condition while controlling for batch ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() ``` ### Continuous Covariates Include continuous variables like age or dosage: ```python # Ensure continuous variable is numeric metadata["age"] = pd.to_numeric(metadata["age"]) dds = DeseqDataSet(counts=counts_df, metadata=metadata, design="~age + condition") dds.deseq2() ds = DeseqStats(dds, contrast=["condition", "treated", "control"]) ds.summary() ``` ## Using the Analysis Script This skill includes a complete command-line script for standard analyses: ```bash # Basic usage python scripts/run_deseq2_analysis.py \ --counts counts.csv \ --metadata metadata.csv \ --design "~condition" \ --contrast condition treated control \ --output results/ # With additional options python scripts/run_deseq2_analysis.py \ --counts counts.csv \ --metadata metadata.csv \ --design "~batch + condition" \ --contrast condition treated control \ --output results/ \ --min-counts 10 \ --alpha 0.05 \ --n-cpus 4 \ --plots ``` **Script features:** - Automatic data loading and validation - Gene and sample filtering - Complete DESeq2 pipeline execution - Statistical testing with customizable parameters - Result export (CSV, pickle) - Optional visualization (volcano and MA plots) Refer users to `scripts/run_deseq2_analysis.py` when they need a standalone analysis tool or want to batch process multiple datasets. ## Result Interpretation ### Identifying Significant Genes ```python # Filter by adjusted p-value significant = ds.results_df[ds.results_df.padj < 0.05] # Filter by both significance and effect size sig_and_large = ds.results_df[ (ds.results_df.padj < 0.05) & (abs(ds.results_df.log2FoldChange) > 1) ] # Separate up- and down-regulated upregulated = significant[significant.log2FoldChange > 0] downregulated = significant[significant.log2FoldChange < 0] print(f"Upregulated: {len(upregulated)}") print(f"Downregulated: {len(downregulated)}") ``` ### Ranking and Sorting ```python # Sort by adjusted p-value top_by_padj = ds.results_df.sort_values("padj").head(20) # Sort by absolute fold change (use shrunk values) ds.lfc_shrink() ds.results_df["abs_lfc"] = abs(ds.results_df.log2FoldChange) top_by_lfc = ds.results_df.sort_values("abs_lfc", ascending=False).head(20) # Sort by a combined metric ds.results_df["score"] = -np.log10(ds.results_df.padj) * abs(ds.results_df.log2FoldChange) top_combined = ds.results_df.sort_values("score", ascending=False).head(20) ``` ### Quality Metrics ```python # Check normalization (size factors should be close to 1) print("Size factors:", dds.obsm["size_factors"]) # Examine dispersion estimates import matplotlib.pyplot as plt plt.hist(dds.varm["dispersions"], bins=50) plt.xlabel("Dispersion") plt.ylabel("Frequency") plt.title("Dispersion Distribution") plt.show() # Check p-value distribution (should be mostly flat with peak near 0) plt.hist(ds.results_df.pvalue.dropna(), bins=50) plt.xlabel("P-value") plt.ylabel("Frequency") plt.title("P-value Distribution") plt.show() ``` ## Visualization Guidelines ### Volcano Plot Visualize significance vs effect size: ```python import matplotlib.pyplot as plt import numpy as np results = ds.results_df.copy() results["-log10(padj)"] = -np.log10(results.padj) plt.figure(figsize=(10, 6)) significant = results.padj < 0.05 plt.scatter( results.loc[~significant, "log2FoldChange"], results.loc[~significant, "-log10(padj)"], alpha=0.3, s=10, c='gray', label='Not significant' ) plt.scatter( results.loc[significant, "log2FoldChange"], results.loc[significant, "-log10(padj)"], alpha=0.6, s=10, c='red', label='padj < 0.05' ) plt.axhline(-np.log10(0.05), color='blue', linestyle='--', alpha=0.5) plt.xlabel("Log2 Fold Change") plt.ylabel("-Log10(Adjusted P-value)") plt.title("Volcano Plot") plt.legend() plt.savefig("volcano_plot.png", dpi=300) ``` ### MA Plot Show fold change vs mean expression: ```python plt.figure(figsize=(10, 6)) plt.scatter( np.log10(results.loc[~significant, "baseMean"] + 1), results.loc[~significant, "log2FoldChange"], alpha=0.3, s=10, c='gray' ) plt.scatter( np.log10(results.loc[significant, "baseMean"] + 1), results.loc[significant, "log2FoldChange"], alpha=0.6, s=10, c='red' ) plt.axhline(0, color='blue', linestyle='--', alpha=0.5) plt.xlabel("Log10(Base Mean + 1)") plt.ylabel("Log2 Fold Change") plt.title("MA Plot") plt.savefig("ma_plot.png", dpi=300) ``` ## Troubleshooting Common Issues ### Data Format Problems **Issue:** "Index mismatch between counts and metadata" **Solution:** Ensure sample names match exactly ```python print("Counts samples:", counts_df.index.tolist()) print("Metadata samples:", metadata.index.tolist()) # Take intersection if needed common = counts_df.index.intersection(metadata.index) counts_df = counts_df.loc[common] metadata = metadata.loc[common] ``` **Issue:** "All genes have zero counts" **Solution:** Check if data needs transposition ```python print(f"Counts shape: {counts_df.shape}") # If genes > samples, transpose is needed if counts_df.shape[1] < counts_df.shape[0]: counts_df = counts_df.T ``` ### Design Matrix Issues **Issue:** "Design matrix is not full rank" **Cause:** Confounded variables (e.g., all treated samples in one batch) **Solution:** Remove confounded variable or add interaction term ```python # Check confounding print(pd.crosstab(metadata.condition, metadata.batch)) # Either simplify design or add interaction design = "~condition" # Remove batch # OR design = "~condition + batch + condition:batch" # Model interaction ``` ### No Significant Genes **Diagnostics:** ```python # Check dispersion distribution plt.hist(dds.varm["dispersions"], bins=50) plt.show() # Check size factors print(dds.obsm["size_factors"]) # Look at top genes by raw p-value print(ds.results_df.nsmallest(20, "pvalue")) ``` **Possible causes:** - Small effect sizes - High biological variability - Insufficient sample size - Technical issues (batch effects, outliers) ## Reference Documentation For comprehensive details beyond this workflow-oriented guide: - **API Reference** (`references/api_reference.md`): Complete documentation of PyDESeq2 classes, methods, and data structures. Use when needing detailed parameter information or understanding object attributes. - **Workflow Guide** (`references/workflow_guide.md`): In-depth guide covering complete analysis workflows, data loading patterns, multi-factor designs, troubleshooting, and best practices. Use when handling complex experimental designs or encountering issues. Load these references into context when users need: - Detailed API documentation: `Read references/api_reference.md` - Comprehensive workflow examples: `Read references/workflow_guide.md` - Troubleshooting guidance: `Read references/workflow_guide.md` (see Troubleshooting section) ## Key Reminders 1. **Data orientation matters:** Count matrices typically load as genes × samples but need to be samples × genes. Always transpose with `.T` if needed. 2. **Sample filtering:** Remove samples with missing metadata before analysis to avoid errors. 3. **Gene filtering:** Filter low-count genes (e.g., < 10 total reads) to improve power and reduce computational time. 4. **Design formula order:** Put adjustment variables before the variable of interest (e.g., `"~batch + condition"` not `"~condition + batch"`). 5. **LFC shrinkage timing:** Apply shrinkage after statistical testing and only for visualization/ranking purposes. P-values remain based on unshrunken estimates. 6. **Result interpretation:** Use `padj < 0.05` for significance, not raw p-values. The Benjamini-Hochberg procedure controls false discovery rate. 7. **Contrast specification:** The format is `[variable, test_level, reference_level]` where test_level is compared against reference_level. 8. **Save intermediate objects:** Use pickle to save DeseqDataSet objects for later use or additional analyses without re-running the expensive fitting step. ## Installation and Requirements PyDESeq2 can be installed via pip or conda: ```bash # Via pip pip install pydeseq2 # Via conda conda install -c bioconda pydeseq2 ``` **System requirements:** - Python 3.10-3.11 - pandas 1.4.3+ - numpy 1.23.0+ - scipy 1.11.0+ - scikit-learn 1.1.1+ - anndata 0.8.0+ **Optional for visualization:** - matplotlib - seaborn ## Additional Resources - **Official Documentation:** https://pydeseq2.readthedocs.io - **GitHub Repository:** https://github.com/owkin/PyDESeq2 - **Publication:** Muzellec et al. (2023) Bioinformatics, DOI: 10.1093/bioinformatics/btad547 - **Original DESeq2 (R):** Love et al. (2014) Genome Biology, DOI: 10.1186/s13059-014-0550-8