# Biomni Task Examples Comprehensive collection of biomedical task examples with code patterns and best practices. ## Table of Contents 1. [Single-Cell RNA-seq Analysis](#single-cell-rna-seq-analysis) 2. [CRISPR Screening](#crispr-screening) 3. [Genomic Analysis (GWAS, Variant Calling)](#genomic-analysis) 4. [Protein Structure and Function](#protein-structure-and-function) 5. [Drug Discovery and ADMET](#drug-discovery-and-admet) 6. [Pathway and Network Analysis](#pathway-and-network-analysis) 7. [Disease Classification](#disease-classification) 8. [Multi-Omics Integration](#multi-omics-integration) 9. [Proteomics Analysis](#proteomics-analysis) 10. [Biomarker Discovery](#biomarker-discovery) --- ## Single-Cell RNA-seq Analysis ### Basic scRNA-seq Pipeline ```python from biomni.agent import A1 agent = A1(path='./data', llm='claude-sonnet-4-20250514') agent.go(""" Analyze the 10X Genomics scRNA-seq dataset located at 'data/pbmc_10k.h5ad'. Workflow: 1. Load the data and perform QC: - Filter cells with <200 genes or >5000 genes - Filter cells with >10% mitochondrial reads - Filter genes expressed in <3 cells 2. Normalize and identify highly variable genes: - Use SCTransform or standard log-normalization - Identify top 2000 HVGs 3. Dimensionality reduction: - PCA (50 components) - UMAP for visualization 4. Clustering: - Find neighbors (k=10) - Leiden clustering with resolution 0.5 5. Visualization: - UMAP colored by cluster - QC metrics on UMAP Save processed data as 'results/pbmc_processed.h5ad' """) ``` ### Cell Type Annotation ```python agent.go(""" Using the processed PBMC data at 'results/pbmc_processed.h5ad': 1. Find marker genes for each cluster: - Wilcoxon rank-sum test - Log fold change > 0.5 - Adjusted p-value < 0.01 - Present in >25% of cluster cells 2. Annotate cell types using markers: - T cells: CD3D, CD3E, CD3G - B cells: CD19, MS4A1 (CD20) - NK cells: GNLY, NKG7, NCAM1 - Monocytes: CD14, LYZ, CD68 - Dendritic cells: FCER1A, CD1C 3. Create visualization: - UMAP with cell type labels - Dotplot of marker genes by cell type - Proportion of cell types (bar plot) 4. Save annotated data with cell types """) ``` ### Differential Expression Between Conditions ```python agent.go(""" Compare gene expression between stimulated and control conditions: Data: 'data/immune_stim_experiment.h5ad' (contains 'condition' metadata) Analysis: 1. Subset to T cells only (cell_type == 'T cell') 2. Differential expression between stim vs control: - Use pseudobulk approach (aggregate by donor + condition) - DESeq2 or edgeR for statistical testing - Filter: |log2FC| > 1, padj < 0.05 3. Pathway enrichment on DEGs: - Use GO biological processes - Use KEGG pathways - Run enrichment analysis with gprofiler or enrichr 4. Visualization: - Volcano plot of DEGs - Heatmap of top 50 DEGs - Bar plot of top enriched pathways 5. Export results table with gene symbols, log2FC, p-values, and pathway annotations """) ``` ### Trajectory Analysis ```python agent.go(""" Perform pseudotime trajectory analysis on hematopoietic differentiation data: Data: 'data/hematopoiesis.h5ad' Steps: 1. Subset to progenitor and mature cell types: - HSC, MPP, GMP, Monocytes, Neutrophils 2. Run trajectory inference: - Use PAGA or Monocle3 - Set HSC as root cell type 3. Calculate pseudotime for all cells 4. Identify trajectory-associated genes: - Genes that change along pseudotime - Statistical test with FDR < 0.05 - Cluster genes by expression pattern (early, middle, late) 5. Visualization: - UMAP colored by pseudotime - Heatmap of trajectory genes - Gene expression along pseudotime for key TFs 6. Functional analysis: - GO enrichment for early/middle/late gene clusters """) ``` ### Integration of Multiple Datasets ```python agent.go(""" Integrate three scRNA-seq datasets from different batches: Data files: - 'data/batch1_pbmc.h5ad' - 'data/batch2_pbmc.h5ad' - 'data/batch3_pbmc.h5ad' Integration workflow: 1. Load all three datasets 2. Perform individual QC on each batch: - Same filters as standard QC - Note batch-specific statistics 3. Integration using Harmony or Scanorama: - Concatenate datasets - Identify HVGs on combined data - Run batch correction - Verify batch mixing with LISI score 4. Re-cluster integrated data: - Use corrected embeddings - Leiden clustering 5. Cell type annotation on integrated data 6. Visualization: - UMAP split by batch (before/after correction) - UMAP colored by cell type - Batch mixing statistics 7. Save integrated dataset """) ``` --- ## CRISPR Screening ### Guide RNA Design ```python agent.go(""" Design guide RNAs for CRISPR knockout screening of cell cycle genes: Target genes: - CDK1, CDK2, CDK4, CDK6 - CCNA2, CCNB1, CCND1, CCNE1 - TP53, RB1, MYC Requirements: 1. Design 4-6 guides per gene targeting early exons 2. For each guide, evaluate: - On-target efficiency score (Doench 2016) - Off-target potential (CFD score < 0.3) - Avoid common SNPs (1000 Genomes) 3. Add control guides: - 100 non-targeting controls - 20 positive controls (essential genes) 4. Output: - Table with: gene, guide_sequence, PAM, position, on_target_score, off_target_count - Sequences in format for oligonucleotide ordering - Visual summary of guide distribution per gene Reference genome: hg38 """) ``` ### CRISPR Screen Analysis ```python agent.go(""" Analyze data from a genome-wide CRISPR knockout screen: Data: 'data/crispr_screen_counts.csv' - Columns: guide_id, gene, sample_T0, sample_T15, replicate - ~80,000 guides targeting ~18,000 genes Analysis: 1. Quality control: - Guide representation (reads per guide) - Sample correlation - Remove guides with <30 reads in T0 2. Normalize counts: - Reads per million (RPM) - Log2 fold change (T15 vs T0) 3. Statistical analysis using MAGeCK: - Identify significantly depleted/enriched genes - FDR < 0.05 - Rank genes by robust rank aggregation (RRA) 4. Functional analysis: - Pathway enrichment of hit genes - Known vs novel essential genes - Correlation with Cancer Dependency Map 5. Visualization: - Scatterplot: log2FC vs -log10(FDR) - Heatmap: top 50 depleted genes across replicates - Network: PPI network of hit genes 6. Export: - Ranked gene list with statistics - Enriched pathways table """) ``` ### Pooled Optical Screening Analysis ```python agent.go(""" Analyze pooled CRISPR screen with imaging readout (e.g., Cell Painting): Data structure: - 'data/guide_assignments.csv': cell_id, guide_id, gene - 'data/morphology_features.csv': cell_id, feature_1...feature_500 Analysis: 1. Feature preprocessing: - Remove low-variance features - Normalize features (z-score per plate) - PCA for dimensionality reduction 2. Associate phenotypes with perturbations: - Aggregate cells by guide (mean/median) - Calculate morphological distance from controls - Statistical test for phenotype change 3. Identify phenotype-altering genes: - Mahalanobis distance from control distribution - Bonferroni correction for multiple testing - Effect size threshold 4. Cluster genes by phenotype similarity: - Hierarchical clustering of gene profiles - Identify phenotype classes 5. Validation and interpretation: - Compare to known gene functions - Pathway enrichment per phenotype cluster 6. Visualization: - UMAP of all perturbations - Heatmap of gene clusters × morphology features - Representative images for each cluster """) ``` --- ## Genomic Analysis ### GWAS Analysis ```python agent.go(""" Perform genome-wide association study for Type 2 Diabetes: Data: - 'data/genotypes.bed' (PLINK format, 500K SNPs, 5000 cases, 5000 controls) - 'data/phenotypes.txt' (sample_id, T2D_status, age, sex, BMI, ancestry_PCs) Workflow: 1. Quality control: - SNP QC: MAF > 0.01, HWE p > 1e-6, genotyping rate > 0.95 - Sample QC: genotyping rate > 0.95, heterozygosity check - Remove related individuals (kinship > 0.125) 2. Association testing: - Logistic regression: T2D ~ SNP + age + sex + BMI + PC1-10 - Genome-wide significance threshold: p < 5e-8 - Suggestive threshold: p < 1e-5 3. Post-GWAS analysis: - LD clumping (r² > 0.1, 500kb window) - Annotate lead SNPs with nearby genes (±100kb) - Query GWAS Catalog for known associations 4. Functional annotation: - Overlap with regulatory elements (ENCODE) - eQTL colocalization (GTEx) - GWAS prioritization scores (PoPS, ABC) 5. Visualization: - Manhattan plot - QQ plot - Regional association plots for top loci - Locus zoom plots 6. Heritability and genetic correlation: - SNP heritability (LDSC) - Genetic correlation with related traits Export summary statistics for meta-analysis """) ``` ### Whole Exome Sequencing Analysis ```python agent.go(""" Analyze whole exome sequencing data for rare disease diagnosis: Data: Family trio (proband, mother, father) - 'data/proband.bam' - 'data/mother.bam' - 'data/father.bam' Phenotype: Developmental delay, seizures, intellectual disability Pipeline: 1. Variant calling: - GATK HaplotypeCaller on each sample - Joint genotyping across trio - VQSR filtering (SNPs and indels separately) 2. Variant annotation: - Functional consequence (VEP or ANNOVAR) - Population frequencies (gnomAD) - Pathogenicity predictions (CADD, REVEL, SpliceAI) - Disease databases (ClinVar, OMIM) 3. Inheritance analysis: - De novo variants (absent in both parents) - Compound heterozygous variants - Rare homozygous variants (autozygosity) - X-linked variants (if proband is male) 4. Filtering strategy: - Population AF < 0.001 (gnomAD) - High-quality variants (GQ > 20, DP > 10) - Loss-of-function or missense with CADD > 20 - Match phenotype to gene function (HPO terms) 5. Prioritization: - Known disease genes for phenotype - De novo in intolerant genes (pLI > 0.9) - Protein-truncating variants 6. Report: - Top candidate variants with evidence - Gene function and disease association - Segregation analysis - Recommended validation (Sanger sequencing) - ACMG variant classification Save VCF with annotations and prioritized candidate list """) ``` ### Variant Calling from RNA-seq ```python agent.go(""" Identify expressed variants from RNA-seq data: Data: Tumor RNA-seq BAM file - 'data/tumor_RNAseq.bam' - Reference: hg38 Purpose: Identify expressed somatic mutations for neoantigen prediction Steps: 1. Pre-processing: - Mark duplicates (Picard) - Split reads at junctions (GATK SplitNCigarReads) - Base quality recalibration 2. Variant calling: - GATK HaplotypeCaller (RNA-seq mode) - Filter: DP > 10, AF > 0.05 3. Filtering artifacts: - Remove common SNPs (gnomAD AF > 0.001) - Filter intronic/intergenic variants - Remove known RNA editing sites (RADAR database) - Panel of normals (if available) 4. Annotation: - Functional impact (VEP) - Identify non-synonymous variants - Predict MHC binding (NetMHCpan for patient HLA type) 5. Prioritize neoantigens: - Strong MHC binding (IC50 < 500nM) - High expression (TPM > 5) - High variant allele frequency 6. Output: - Annotated VCF - Neoantigen candidates table - Peptide sequences for validation This requires patient HLA typing data """) ``` --- ## Protein Structure and Function ### Protein Structure Prediction and Analysis ```python agent.go(""" Predict and analyze structure for novel protein sequence: Sequence (FASTA format): >Novel_Kinase_Domain MKLLVVDDDGVADYSKRDGAFMVAYCIEPGDG... Tasks: 1. Structure prediction: - Use AlphaFold2 or ESMFold - Generate 5 models, rank by confidence 2. Quality assessment: - pLDDT scores (per-residue confidence) - pTM score (global confidence) - Identify low-confidence regions 3. Domain identification: - InterProScan for domain architecture - Pfam domain search - Identify catalytic residues 4. Functional site prediction: - Active site prediction - Substrate binding pocket identification - Post-translational modification sites 5. Structural alignment: - Search for similar structures (PDB) - Align to close homologs - Identify conserved structural motifs 6. Mutation analysis: - Known disease mutations in homologs - Predict impact on structure (Rosetta ddG) 7. Visualization and output: - PyMOL/Chimera visualization scripts - Structural alignment figures - Annotated PDB file with functional sites - Summary report with predictions """) ``` ### Protein-Protein Interaction Prediction ```python agent.go(""" Predict and validate protein-protein interactions: Target protein: BRCA1 Species: Human Analysis: 1. Literature-based interactions: - Query BioGRID, STRING, IntAct databases - Extract high-confidence interactors (score > 0.7) 2. Structure-based prediction: - Predict BRCA1 structure (if not available) - Dock with known interactors (BRCA2, BARD1, etc.) - Score interfaces (PISA, PDBePISA) 3. Sequence-based prediction: - Coevolution analysis (EVcouplings) - Domain-domain interaction prediction - Linear motif search (ELM database) 4. Functional analysis of interactors: - GO enrichment analysis - KEGG pathway membership - Tissue/cell type expression patterns 5. Network analysis: - Build PPI network - Identify network modules - Central hub proteins 6. Experimental validation suggestions: - Prioritize interactions for validation - Suggest Co-IP or Y2H experiments - Identify commercially available antibodies 7. Output: - Ranked interaction list with evidence - PPI network visualization - Structural models of key interactions """) ``` ### Protein Engineering Design ```python agent.go(""" Design improved enzyme variant with enhanced thermostability: Target enzyme: TEM-1 β-lactamase Goal: Increase melting temperature by >10°C while maintaining activity Strategy: 1. Analyze current structure: - Load PDB structure (1BTL) - Identify flexible regions (B-factors) - Find potential disulfide bond sites 2. Computational design: - Rosetta design simulations - Identify stabilizing mutations (ΔΔG < -1.0 kcal/mol) - Avoid active site and substrate binding regions 3. Prioritize mutations: - Surface entropy reduction (SER) - Disulfide bond introduction - Salt bridge formation - Hydrophobic core packing 4. Check conservation: - Multiple sequence alignment of β-lactamases - Avoid highly conserved positions - Prefer positions with natural variation 5. Design library: - Rank top 20 single mutants - Design 5 combinatorial variants (2-3 mutations) - Ensure codon optimization for E. coli 6. Validation plan: - Expression and purification protocol - Thermal shift assay (DSF) - Activity assay (nitrocefin) - Recommended high-throughput screening 7. Output: - Ranked mutation list with predicted ΔΔG - Structural visualizations - Codon-optimized sequences - Cloning primers - Experimental validation protocol """) ``` --- ## Drug Discovery and ADMET ### Virtual Screening ```python agent.go(""" Perform virtual screening for SARS-CoV-2 Mpro inhibitors: Target: SARS-CoV-2 Main protease (Mpro) Crystal structure: PDB 6LU7 Compound library: ZINC15 drug-like subset (~100K compounds) File: 'data/zinc_druglike_100k.smi' (SMILES format) Workflow: 1. Protein preparation: - Remove crystallographic waters (keep catalytic waters) - Add hydrogens, optimize H-bond network - Define binding site (residues within 5Å of native ligand) 2. Ligand preparation: - Generate 3D coordinates from SMILES - Enumerate tautomers and protonation states - Energy minimization 3. Molecular docking: - Dock all compounds (AutoDock Vina or Glide) - Generate top 3 poses per compound - Score binding affinity 4. Consensus scoring: - Combine multiple scoring functions - Rank compounds by consensus score 5. ADMET filtering: - Lipinski's rule of 5 - BBB permeability (not needed for this target) - hERG liability (pIC50 > 5) - CYP450 inhibition prediction - Toxicity prediction (Tox21) 6. Visual inspection: - Top 100 compounds - Check key interactions (His41, Cys145 catalytic dyad) - Remove PAINS and frequent hitters 7. Final selection: - Top 20 compounds for experimental testing - Cluster by scaffold diversity 8. Output: - Ranked compound list with scores and ADMET properties - Docking poses (mol2 or PDB format) - 2D interaction diagrams - Purchase availability from vendors """) ``` ### ADMET Property Prediction ```python agent.go(""" Predict ADMET properties for drug candidate series: Input: 'data/compound_series.smi' (25 analogs, SMILES format) Lead scaffold: Novel kinase inhibitor series Properties to predict: 1. Absorption: - Caco-2 permeability - Human intestinal absorption (HIA) - P-glycoprotein substrate 2. Distribution: - Plasma protein binding (% bound) - Volume of distribution (VDss) - Blood-brain barrier permeability (LogBB) 3. Metabolism: - CYP450 substrate (1A2, 2C9, 2C19, 2D6, 3A4) - CYP450 inhibition (same isoforms) - Sites of metabolism (SOM prediction) 4. Excretion: - Clearance estimation - Half-life prediction - Renal excretion likelihood 5. Toxicity: - hERG inhibition (cardiotoxicity) - AMES mutagenicity - Hepatotoxicity - Skin sensitization - Rat acute toxicity (LD50) 6. Drug-likeness: - Lipinski's Ro5 - QED score - Synthetic accessibility Analysis: - Compare all analogs in the series - Structure-property relationships - Identify best balanced compound - Suggest modifications for improvement Output: - Comprehensive ADMET table - Radar plots for each compound - SAR analysis for each property - Recommendations for next design iteration """) ``` ### Lead Optimization ```python agent.go(""" Optimize lead compound balancing potency and selectivity: Current lead: - IC50 (target kinase): 50 nM - IC50 (off-target kinases): 100-500 nM (poor selectivity) - Microsomal stability: t1/2 = 20 min (too short) - Solubility: 5 μM (low) Goal: Maintain potency, improve selectivity (>100x), improve PK properties Strategy: 1. Analyze current binding mode: - Docking to target and off-targets - Identify selectivity-determining residues - Map interaction hotspots 2. Design focused library: - Modifications to improve selectivity: * Target residues unique to on-target * Avoid conserved kinase regions - Modifications to improve solubility: * Add polar groups to solvent-exposed regions * Replace lipophilic groups - Modifications to improve metabolic stability: * Block metabolically labile positions * Replace metabolically unstable groups 3. Virtual enumeration: - Generate ~200 analogs - Predict binding affinity (docking) - Predict ADMET properties 4. Multi-parameter optimization: - Calculate MPO score (potency + selectivity + ADMET) - Pareto optimization - Select top 20 compounds 5. Clustering and diversity: - Ensure structural diversity - Test different modification strategies 6. Synthetic feasibility: - Retrosynthetic analysis - Flag difficult syntheses - Prioritize 10 compounds for synthesis 7. Deliverables: - Ranked compound designs - Predicted properties table - Binding mode visualizations - Synthetic routes - Recommended testing cascade """) ``` --- ## Pathway and Network Analysis ### Pathway Enrichment Analysis ```python agent.go(""" Perform comprehensive pathway enrichment on differentially expressed genes: Input: 'data/DEGs.csv' Columns: gene_symbol, log2FC, padj Significant DEGs: padj < 0.05, |log2FC| > 1 Total: 450 upregulated, 380 downregulated genes Background: all detected genes in the experiment (~15,000) Analysis: 1. GO enrichment (biological processes): - Test upregulated and downregulated genes separately - Use hypergeometric test - FDR correction (Benjamini-Hochberg) - Filter: padj < 0.05, fold enrichment > 2 2. KEGG pathway enrichment: - Same approach as GO - Focus on signaling and metabolic pathways 3. Reactome pathway enrichment: - More detailed pathway hierarchy 4. Disease association: - DisGeNET disease enrichment - Compare to disease gene signatures (MSigDB) 5. Transcription factor enrichment: - Predict upstream regulators (ChEA3) - ENCODE ChIP-seq enrichment 6. Drug/compound perturbations: - L1000 connectivity map - Identify drugs that reverse/mimic signature 7. Cross-pathway analysis: - Pathway crosstalk - Hierarchical clustering of pathways by gene overlap - Network visualization of enriched pathways 8. Visualization: - Dot plots (GO, KEGG, Reactome) - Enrichment map network - Chord diagram (genes-pathways) - Treemap of hierarchical GO terms 9. Export: - All enrichment tables - Pathway gene lists - Interactive HTML report """) ``` ### Protein-Protein Interaction Network ```python agent.go(""" Build and analyze PPI network for Alzheimer's disease genes: Seed genes: Known AD risk genes (APP, PSEN1, PSEN2, APOE, MAPT, etc.) File: 'data/AD_seed_genes.txt' Network construction: 1. Build network from seed genes: - Query STRING database (confidence > 0.7) - Include direct and second-degree interactors - Maximum network size: 500 proteins 2. Network enrichment: - Add disease associations (DisGeNET) - Add tissue expression (GTEx - prioritize brain) - Add functional annotations (GO, Reactome) 3. Network analysis: - Calculate centrality measures: * Degree centrality * Betweenness centrality * Eigenvector centrality - Identify hub proteins - Community detection (Louvain algorithm) 4. Module analysis: - Functional enrichment per community - Identify disease-relevant modules - Key bridge proteins between modules 5. Druggability analysis: - Identify druggable targets (DGIdb) - Known drugs targeting network proteins - Clinical trial status 6. Network perturbation: - Simulate gene knockout - Network robustness analysis - Identify critical nodes 7. Visualization: - Interactive network (Cytoscape format) - Layout by module membership - Color by centrality/expression - Size by degree 8. Prioritization: - Rank proteins by: * Network centrality * Brain expression * Druggability * Genetic evidence (GWAS) - Top therapeutic targets Output: - Network file (graphML, SIF) - Module membership table - Prioritized target list - Druggable targets with existing compounds """) ``` ### Gene Regulatory Network Inference ```python agent.go(""" Infer gene regulatory network from scRNA-seq data: Data: 'data/development_timecourse.h5ad' - Cells from 5 developmental timepoints - 3000 HVGs quantified Goal: Identify TF→target relationships during development Methods: 1. Preprocessing: - Select TFs (from TF census list) - Select potential target genes (HVGs) - Normalize expression 2. GRN inference using multiple methods: - GENIE3 (random forest) - SCENIC (motif-based) - CellOracle (perturbation-based) - Pearson/Spearman correlation (baseline) 3. Integrate predictions: - Combine scores from multiple methods - Weight by motif evidence (JASPAR) - Filter low-confidence edges 4. Network refinement: - Remove indirect edges (transitive reduction) - Validate with ChIP-seq data (if available) - Literature validation (TRRUST database) 5. Dynamic network analysis: - TF activity per timepoint/cell state - Identify stage-specific regulators - Find regulatory switches 6. Downstream analysis: - Master regulators (high out-degree) - Regulatory cascades - Feed-forward loops - Coherent vs incoherent motifs 7. Experimental validation priorities: - Rank TF→target edges for validation - Suggest ChIP-seq or CUT&RUN experiments - Suggest perturbation experiments (knockout/CRISPRi) 8. Visualization: - Full GRN network (Cytoscape) - Key TF subnetworks - TF activity heatmap across development - Sankey diagram of regulatory flow Output: - Edge list with confidence scores - TF activity matrix - Validated vs novel interactions - Prioritized validation experiments """) ``` --- ## Disease Classification ### Cancer Type Classification from Gene Expression ```python agent.go(""" Build multi-class classifier for cancer type prediction: Data: TCGA pan-cancer RNA-seq data - Training: 8000 samples across 33 cancer types - Expression: 'data/tcga_expression.csv' (samples × genes) - Labels: 'data/tcga_labels.csv' (sample_id, cancer_type) Task: Classify tumor samples by cancer type Pipeline: 1. Data preprocessing: - Log2(TPM + 1) transformation - Remove low-variance genes (variance < 0.1) - Z-score normalization 2. Feature selection: - Variance filtering (top 5000 genes) - Univariate feature selection (ANOVA F-test) - Select top 500 features 3. Train-test split: - 80% train, 20% test - Stratified by cancer type 4. Model training (compare multiple algorithms): - Random Forest - Gradient Boosting (XGBoost) - Neural Network (MLP) - Elastic Net logistic regression 5. Model evaluation: - Accuracy, precision, recall per class - Confusion matrix - ROC curves (one-vs-rest) - Feature importance ranking 6. Model interpretation: - SHAP values for predictions - Top predictive genes per cancer type - Pathway enrichment of predictive features 7. Clinical validation: - Test on independent dataset (if available) - Analyze misclassifications - Identify hard-to-classify subtypes 8. Deliverables: - Trained model (pickle) - Performance metrics report - Feature importance table - Confusion matrix heatmap - Prediction script for new samples """) ``` ### Disease Risk Prediction from Multi-Omics ```python agent.go(""" Develop integrative model predicting cardiovascular disease risk: Data sources: 1. Genotypes: 'data/genotypes.csv' (500K SNPs, polygenic risk scores) 2. Clinical: 'data/clinical.csv' (age, sex, BMI, blood pressure, cholesterol) 3. Proteomics: 'data/proteomics.csv' (200 plasma proteins, Olink panel) 4. Metabolomics: 'data/metabolomics.csv' (150 metabolites) Outcome: 10-year CVD incidence (binary) - Cases: 800 - Controls: 3200 Approach: 1. Data preprocessing: - Impute missing values (missForest) - Transform skewed features (log/Box-Cox) - Normalize each omics layer separately 2. Feature engineering: - Calculate PRS from SNP data - Interaction terms (age × metabolites, etc.) - Metabolite ratios (known CVD markers) 3. Feature selection per omics: - Lasso for each data type - Select informative features 4. Integration strategies (compare): - Early integration: concatenate all features - Late integration: separate models, combine predictions - Intermediate integration: Multi-omics factor analysis (MOFA) 5. Model development: - Logistic regression (interpretable baseline) - Random Forest - Elastic Net - Neural network with omics-specific layers 6. Cross-validation: - 5-fold CV, stratified - Hyperparameter tuning - Calculate confidence intervals 7. Model evaluation: - AUC-ROC, AUC-PR - Calibration plots - Net reclassification improvement (NRI) - Compare to clinical models (Framingham, SCORE) 8. Interpretation: - Feature importance (permutation importance) - SHAP values for individuals - Identify most informative omics layer 9. Clinical utility: - Decision curve analysis - Risk stratification groups - Biomarker panel selection Outputs: - Model comparison table - ROC curves all models - Feature importance per omics - Reclassification table - Clinical implementation recommendations """) ``` --- ## Multi-Omics Integration ### Multi-Omics Data Integration ```python agent.go(""" Integrate transcriptomics, proteomics, and metabolomics data: Study: Drug response in cancer cell lines Data: - RNA-seq: 'data/transcriptomics.csv' (15000 genes × 50 cell lines) - Proteomics: 'data/proteomics.csv' (3000 proteins × 50 cell lines) - Metabolomics: 'data/metabolomics.csv' (200 metabolites × 50 cell lines) - Drug response: 'data/drug_response.csv' (cell line, drug, IC50) Goal: Identify multi-omics signatures of drug sensitivity Analysis: 1. Data preprocessing: - Match samples across omics layers - Filter low-variance features per omics - Normalize each omics separately (z-score) 2. Integration methods (compare): **Method 1: MOFA (Multi-Omics Factor Analysis)** - Identify latent factors capturing variance across omics - Determine factor contributions per omics - Relate factors to drug response **Method 2: DIABLO (sparse PLS-DA)** - Supervised integration - Maximize covariance between omics and drug response - Select features from each omics layer **Method 3: Similarity Network Fusion (SNF)** - Build patient similarity networks per omics - Fuse networks - Cluster cell lines by integrated similarity 3. Association with drug response: - Correlation of factors/components with IC50 - Identify drug-sensitive vs resistant groups - Multi-omics biomarkers 4. Network analysis: - Build multi-layer network: * Gene regulatory network (RNA) * Protein-protein interactions (proteins) * Gene-metabolite associations - Integrate layers - Find dysregulated pathways 5. Predictive modeling: - Train model predicting drug response from multi-omics - Compare: using all omics vs individual omics - Feature selection across omics 6. Biological interpretation: - Map features to pathways - Identify mechanism of drug action - Suggest combination therapies 7. Validation: - Leave-one-out cross-validation - Test in independent cell line panel Outputs: - Factor loadings per omics (MOFA) - Multi-omics biomarker signature - Integrated network visualization - Predictive model of drug response - Mechanistic hypotheses """) ``` --- ## Proteomics Analysis ### Label-Free Quantitative Proteomics ```python agent.go(""" Analyze label-free proteomics data from mass spectrometry: Study: Comparison of normal vs diseased tissue (n=6 per group) Data: MaxQuant output - 'data/proteinGroups.txt' (MaxQuant protein quantification) - 'data/peptides.txt' (peptide-level data) Experimental design: - 6 normal samples - 6 disease samples - TMT-labeled, 3 fractions each Analysis: 1. Data loading and QC: - Load proteinGroups.txt - Remove contaminants, reverse hits - Filter: valid values in ≥50% of samples per group - Check sample correlations and outliers - PCA for quality assessment 2. Imputation: - Impute missing values (MAR vs MNAR approach) - Use MinProb for low-abundance missing values - Use kNN for random missing values 3. Normalization: - Median normalization - Or: VSN (variance stabilizing normalization) 4. Differential expression: - Two-sample t-test (for each protein) - Moderated t-test (limma) - Filter: |log2FC| > 0.58 (~1.5-fold), adj.p < 0.05 5. Visualization: - Volcano plot - Heatmap of significant proteins - PCA colored by condition - Intensity distributions (before/after normalization) 6. Functional enrichment: - GO enrichment (up and down separately) - KEGG pathways - Reactome pathways - STRING PPI network of DEPs 7. PTM analysis (if available): - Phosphorylation site analysis - Kinase enrichment analysis (KEA3) 8. Orthogonal validation: - Compare to RNA-seq data (if available) - Protein-RNA correlation - Identify discordant genes 9. Biomarker candidates: - Rank proteins by fold-change and significance - Filter for secreted proteins (potential biomarkers) - Check if targetable (druggable) Outputs: - Differential abundance table - QC report with plots - Enrichment analysis results - PPI network of DEPs - Candidate biomarkers list """) ``` --- ## Biomarker Discovery ### Diagnostic Biomarker Discovery ```python agent.go(""" Discover diagnostic biomarkers for early cancer detection: Study: Plasma proteomics comparing early-stage cancer vs healthy controls Data: - 'data/proteomics.csv' (1000 proteins × 200 samples) - 'data/metadata.csv' (sample_id, group [cancer/healthy], age, sex) Groups: - Early-stage cancer: 100 samples - Healthy controls: 100 samples Goal: Identify protein panel for early detection (target AUC > 0.90) Workflow: 1. Exploratory analysis: - PCA, tSNE to visualize separation - Univariate differential abundance - Volcano plot 2. Feature selection: - Rank proteins by: * Fold change * Statistical significance (t-test, Mann-Whitney) * AUC (each protein individually) - Select proteins with AUC > 0.70 3. Biomarker panel construction: - Correlation analysis (remove redundant markers) - Forward selection: * Start with best single marker * Add markers improving panel performance * Stop when no improvement - Aim for 5-10 marker panel (practical for assay) 4. Model building: - Logistic regression on selected panel - Calculate combined risk score - Cross-validation (10-fold) 5. Performance evaluation: - AUC-ROC, AUC-PR - Sensitivity/specificity at different thresholds - Clinical decision threshold (e.g., 90% sensitivity) - Calibration plot 6. Biological validation: - Literature support for cancer association - Expression in tumor vs blood - Mechanism of release/shedding 7. Clinical utility: - Compare to existing biomarkers (CEA, CA19-9, etc.) - Cost-effectiveness consideration - Assay feasibility (ELISA, MRM, etc.) 8. Independent validation plan: - Power calculation for validation cohort - Suggested sample size - Pre-analytical variables to control Outputs: - Ranked protein list with individual performance - Final biomarker panel - Logistic regression model - ROC curves (individual + panel) - Clinical characteristics table - Validation study protocol """) ``` --- ## Additional Advanced Examples ### Spatial Transcriptomics Analysis ```python agent.go(""" Analyze Visium spatial transcriptomics data: Data: 'data/visium_brain_tumor.h5ad' - Contains spatial coordinates and gene expression - Tissue: Brain tumor biopsy Analysis: 1. Data QC and normalization: - Filter low-quality spots (total counts, detected genes) - Normalize, log-transform - Calculate spatial statistics 2. Spatial clustering: - Graph-based clustering considering spatial proximity - Identify tumor regions, stroma, necrosis, etc. 3. Spatially variable genes: - Test for spatial patterns (Moran's I, SpatialDE) - Identify genes with spatial gradients 4. Deconvolution: - Estimate cell type composition per spot - Use scRNA-seq reference (if available) - Methods: Cell2location, RCTD, SPOTlight 5. Niche analysis: - Define tissue niches by cell type composition - Identify tumor-stroma interface - Analyze cell-cell interactions 6. Spatial pathway analysis: - Map pathway activity onto tissue - Identify spatially localized processes 7. Visualization: - Spatial plots colored by cluster, gene expression - Cell type composition maps - Pathway activity maps Output: - Annotated spatial data object - Spatially variable gene list - Cell type composition per spot - Niche definitions and cell-cell interactions """) ``` --- ## Tips for Effective Task Specification ### 1. Be Specific About Data Formats and Locations ✅ Good: ```python agent.go("Analyze scRNA-seq data in AnnData format at 'data/experiment1.h5ad'") ``` ❌ Vague: ```python agent.go("Analyze my data") ``` ### 2. Specify Analysis Parameters ✅ Good: ```python agent.go(""" Cluster cells using Leiden algorithm with resolution 0.5, k-neighbors=10, using PCA components 1-30 """) ``` ❌ Vague: ```python agent.go("Cluster the cells") ``` ### 3. Request Specific Outputs ✅ Good: ```python agent.go(""" ... and save results as: - CSV table with statistics - PNG figures at 300 DPI - Processed data as AnnData at 'results/processed.h5ad' """) ``` ❌ Vague: ```python agent.go("... and save the results") ``` ### 4. Provide Biological Context ✅ Good: ```python agent.go(""" This is a drug treatment experiment. Compare vehicle vs treated groups to identify drug-induced transcriptional changes. Focus on apoptosis and cell cycle pathways. """) ``` ❌ Vague: ```python agent.go("Compare the two groups") ``` ### 5. Break Complex Analyses into Steps ✅ Good: ```python # Step 1 agent.go("Load and QC the data, save QC metrics") # Step 2 agent.go("Based on QC, normalize and find HVGs") # Step 3 agent.go("Cluster and annotate cell types") ``` ❌ Overwhelming: ```python agent.go("Do a complete scRNA-seq analysis pipeline") ```