Files
claude-scientific-skills/scientific-packages/biomni/references/task_examples.md
2025-10-19 14:12:02 -07:00

36 KiB
Raw Blame History

Biomni Task Examples

Comprehensive collection of biomedical task examples with code patterns and best practices.

Table of Contents

  1. Single-Cell RNA-seq Analysis
  2. CRISPR Screening
  3. Genomic Analysis (GWAS, Variant Calling)
  4. Protein Structure and Function
  5. Drug Discovery and ADMET
  6. Pathway and Network Analysis
  7. Disease Classification
  8. Multi-Omics Integration
  9. Proteomics Analysis
  10. Biomarker Discovery

Single-Cell RNA-seq Analysis

Basic scRNA-seq Pipeline

from biomni.agent import A1

agent = A1(path='./data', llm='claude-sonnet-4-20250514')

agent.go("""
Analyze the 10X Genomics scRNA-seq dataset located at 'data/pbmc_10k.h5ad'.

Workflow:
1. Load the data and perform QC:
   - Filter cells with <200 genes or >5000 genes
   - Filter cells with >10% mitochondrial reads
   - Filter genes expressed in <3 cells

2. Normalize and identify highly variable genes:
   - Use SCTransform or standard log-normalization
   - Identify top 2000 HVGs

3. Dimensionality reduction:
   - PCA (50 components)
   - UMAP for visualization

4. Clustering:
   - Find neighbors (k=10)
   - Leiden clustering with resolution 0.5

5. Visualization:
   - UMAP colored by cluster
   - QC metrics on UMAP

Save processed data as 'results/pbmc_processed.h5ad'
""")

Cell Type Annotation

agent.go("""
Using the processed PBMC data at 'results/pbmc_processed.h5ad':

1. Find marker genes for each cluster:
   - Wilcoxon rank-sum test
   - Log fold change > 0.5
   - Adjusted p-value < 0.01
   - Present in >25% of cluster cells

2. Annotate cell types using markers:
   - T cells: CD3D, CD3E, CD3G
   - B cells: CD19, MS4A1 (CD20)
   - NK cells: GNLY, NKG7, NCAM1
   - Monocytes: CD14, LYZ, CD68
   - Dendritic cells: FCER1A, CD1C

3. Create visualization:
   - UMAP with cell type labels
   - Dotplot of marker genes by cell type
   - Proportion of cell types (bar plot)

4. Save annotated data with cell types
""")

Differential Expression Between Conditions

agent.go("""
Compare gene expression between stimulated and control conditions:

Data: 'data/immune_stim_experiment.h5ad' (contains 'condition' metadata)

Analysis:
1. Subset to T cells only (cell_type == 'T cell')

2. Differential expression between stim vs control:
   - Use pseudobulk approach (aggregate by donor + condition)
   - DESeq2 or edgeR for statistical testing
   - Filter: |log2FC| > 1, padj < 0.05

3. Pathway enrichment on DEGs:
   - Use GO biological processes
   - Use KEGG pathways
   - Run enrichment analysis with gprofiler or enrichr

4. Visualization:
   - Volcano plot of DEGs
   - Heatmap of top 50 DEGs
   - Bar plot of top enriched pathways

5. Export results table with gene symbols, log2FC, p-values, and pathway annotations
""")

Trajectory Analysis

agent.go("""
Perform pseudotime trajectory analysis on hematopoietic differentiation data:

Data: 'data/hematopoiesis.h5ad'

Steps:
1. Subset to progenitor and mature cell types:
   - HSC, MPP, GMP, Monocytes, Neutrophils

2. Run trajectory inference:
   - Use PAGA or Monocle3
   - Set HSC as root cell type

3. Calculate pseudotime for all cells

4. Identify trajectory-associated genes:
   - Genes that change along pseudotime
   - Statistical test with FDR < 0.05
   - Cluster genes by expression pattern (early, middle, late)

5. Visualization:
   - UMAP colored by pseudotime
   - Heatmap of trajectory genes
   - Gene expression along pseudotime for key TFs

6. Functional analysis:
   - GO enrichment for early/middle/late gene clusters
""")

Integration of Multiple Datasets

agent.go("""
Integrate three scRNA-seq datasets from different batches:

Data files:
- 'data/batch1_pbmc.h5ad'
- 'data/batch2_pbmc.h5ad'
- 'data/batch3_pbmc.h5ad'

Integration workflow:
1. Load all three datasets

2. Perform individual QC on each batch:
   - Same filters as standard QC
   - Note batch-specific statistics

3. Integration using Harmony or Scanorama:
   - Concatenate datasets
   - Identify HVGs on combined data
   - Run batch correction
   - Verify batch mixing with LISI score

4. Re-cluster integrated data:
   - Use corrected embeddings
   - Leiden clustering

5. Cell type annotation on integrated data

6. Visualization:
   - UMAP split by batch (before/after correction)
   - UMAP colored by cell type
   - Batch mixing statistics

7. Save integrated dataset
""")

CRISPR Screening

Guide RNA Design

agent.go("""
Design guide RNAs for CRISPR knockout screening of cell cycle genes:

Target genes:
- CDK1, CDK2, CDK4, CDK6
- CCNA2, CCNB1, CCND1, CCNE1
- TP53, RB1, MYC

Requirements:
1. Design 4-6 guides per gene targeting early exons

2. For each guide, evaluate:
   - On-target efficiency score (Doench 2016)
   - Off-target potential (CFD score < 0.3)
   - Avoid common SNPs (1000 Genomes)

3. Add control guides:
   - 100 non-targeting controls
   - 20 positive controls (essential genes)

4. Output:
   - Table with: gene, guide_sequence, PAM, position, on_target_score, off_target_count
   - Sequences in format for oligonucleotide ordering
   - Visual summary of guide distribution per gene

Reference genome: hg38
""")

CRISPR Screen Analysis

agent.go("""
Analyze data from a genome-wide CRISPR knockout screen:

Data: 'data/crispr_screen_counts.csv'
- Columns: guide_id, gene, sample_T0, sample_T15, replicate
- ~80,000 guides targeting ~18,000 genes

Analysis:
1. Quality control:
   - Guide representation (reads per guide)
   - Sample correlation
   - Remove guides with <30 reads in T0

2. Normalize counts:
   - Reads per million (RPM)
   - Log2 fold change (T15 vs T0)

3. Statistical analysis using MAGeCK:
   - Identify significantly depleted/enriched genes
   - FDR < 0.05
   - Rank genes by robust rank aggregation (RRA)

4. Functional analysis:
   - Pathway enrichment of hit genes
   - Known vs novel essential genes
   - Correlation with Cancer Dependency Map

5. Visualization:
   - Scatterplot: log2FC vs -log10(FDR)
   - Heatmap: top 50 depleted genes across replicates
   - Network: PPI network of hit genes

6. Export:
   - Ranked gene list with statistics
   - Enriched pathways table
""")

Pooled Optical Screening Analysis

agent.go("""
Analyze pooled CRISPR screen with imaging readout (e.g., Cell Painting):

Data structure:
- 'data/guide_assignments.csv': cell_id, guide_id, gene
- 'data/morphology_features.csv': cell_id, feature_1...feature_500

Analysis:
1. Feature preprocessing:
   - Remove low-variance features
   - Normalize features (z-score per plate)
   - PCA for dimensionality reduction

2. Associate phenotypes with perturbations:
   - Aggregate cells by guide (mean/median)
   - Calculate morphological distance from controls
   - Statistical test for phenotype change

3. Identify phenotype-altering genes:
   - Mahalanobis distance from control distribution
   - Bonferroni correction for multiple testing
   - Effect size threshold

4. Cluster genes by phenotype similarity:
   - Hierarchical clustering of gene profiles
   - Identify phenotype classes

5. Validation and interpretation:
   - Compare to known gene functions
   - Pathway enrichment per phenotype cluster

6. Visualization:
   - UMAP of all perturbations
   - Heatmap of gene clusters × morphology features
   - Representative images for each cluster
""")

Genomic Analysis

GWAS Analysis

agent.go("""
Perform genome-wide association study for Type 2 Diabetes:

Data:
- 'data/genotypes.bed' (PLINK format, 500K SNPs, 5000 cases, 5000 controls)
- 'data/phenotypes.txt' (sample_id, T2D_status, age, sex, BMI, ancestry_PCs)

Workflow:
1. Quality control:
   - SNP QC: MAF > 0.01, HWE p > 1e-6, genotyping rate > 0.95
   - Sample QC: genotyping rate > 0.95, heterozygosity check
   - Remove related individuals (kinship > 0.125)

2. Association testing:
   - Logistic regression: T2D ~ SNP + age + sex + BMI + PC1-10
   - Genome-wide significance threshold: p < 5e-8
   - Suggestive threshold: p < 1e-5

3. Post-GWAS analysis:
   - LD clumping (r² > 0.1, 500kb window)
   - Annotate lead SNPs with nearby genes (±100kb)
   - Query GWAS Catalog for known associations

4. Functional annotation:
   - Overlap with regulatory elements (ENCODE)
   - eQTL colocalization (GTEx)
   - GWAS prioritization scores (PoPS, ABC)

5. Visualization:
   - Manhattan plot
   - QQ plot
   - Regional association plots for top loci
   - Locus zoom plots

6. Heritability and genetic correlation:
   - SNP heritability (LDSC)
   - Genetic correlation with related traits

Export summary statistics for meta-analysis
""")

Whole Exome Sequencing Analysis

agent.go("""
Analyze whole exome sequencing data for rare disease diagnosis:

Data: Family trio (proband, mother, father)
- 'data/proband.bam'
- 'data/mother.bam'
- 'data/father.bam'

Phenotype: Developmental delay, seizures, intellectual disability

Pipeline:
1. Variant calling:
   - GATK HaplotypeCaller on each sample
   - Joint genotyping across trio
   - VQSR filtering (SNPs and indels separately)

2. Variant annotation:
   - Functional consequence (VEP or ANNOVAR)
   - Population frequencies (gnomAD)
   - Pathogenicity predictions (CADD, REVEL, SpliceAI)
   - Disease databases (ClinVar, OMIM)

3. Inheritance analysis:
   - De novo variants (absent in both parents)
   - Compound heterozygous variants
   - Rare homozygous variants (autozygosity)
   - X-linked variants (if proband is male)

4. Filtering strategy:
   - Population AF < 0.001 (gnomAD)
   - High-quality variants (GQ > 20, DP > 10)
   - Loss-of-function or missense with CADD > 20
   - Match phenotype to gene function (HPO terms)

5. Prioritization:
   - Known disease genes for phenotype
   - De novo in intolerant genes (pLI > 0.9)
   - Protein-truncating variants

6. Report:
   - Top candidate variants with evidence
   - Gene function and disease association
   - Segregation analysis
   - Recommended validation (Sanger sequencing)
   - ACMG variant classification

Save VCF with annotations and prioritized candidate list
""")

Variant Calling from RNA-seq

agent.go("""
Identify expressed variants from RNA-seq data:

Data: Tumor RNA-seq BAM file
- 'data/tumor_RNAseq.bam'
- Reference: hg38

Purpose: Identify expressed somatic mutations for neoantigen prediction

Steps:
1. Pre-processing:
   - Mark duplicates (Picard)
   - Split reads at junctions (GATK SplitNCigarReads)
   - Base quality recalibration

2. Variant calling:
   - GATK HaplotypeCaller (RNA-seq mode)
   - Filter: DP > 10, AF > 0.05

3. Filtering artifacts:
   - Remove common SNPs (gnomAD AF > 0.001)
   - Filter intronic/intergenic variants
   - Remove known RNA editing sites (RADAR database)
   - Panel of normals (if available)

4. Annotation:
   - Functional impact (VEP)
   - Identify non-synonymous variants
   - Predict MHC binding (NetMHCpan for patient HLA type)

5. Prioritize neoantigens:
   - Strong MHC binding (IC50 < 500nM)
   - High expression (TPM > 5)
   - High variant allele frequency

6. Output:
   - Annotated VCF
   - Neoantigen candidates table
   - Peptide sequences for validation

This requires patient HLA typing data
""")

Protein Structure and Function

Protein Structure Prediction and Analysis

agent.go("""
Predict and analyze structure for novel protein sequence:

Sequence (FASTA format):
>Novel_Kinase_Domain
MKLLVVDDDGVADYSKRDGAFMVAYCIEPGDG...

Tasks:
1. Structure prediction:
   - Use AlphaFold2 or ESMFold
   - Generate 5 models, rank by confidence

2. Quality assessment:
   - pLDDT scores (per-residue confidence)
   - pTM score (global confidence)
   - Identify low-confidence regions

3. Domain identification:
   - InterProScan for domain architecture
   - Pfam domain search
   - Identify catalytic residues

4. Functional site prediction:
   - Active site prediction
   - Substrate binding pocket identification
   - Post-translational modification sites

5. Structural alignment:
   - Search for similar structures (PDB)
   - Align to close homologs
   - Identify conserved structural motifs

6. Mutation analysis:
   - Known disease mutations in homologs
   - Predict impact on structure (Rosetta ddG)

7. Visualization and output:
   - PyMOL/Chimera visualization scripts
   - Structural alignment figures
   - Annotated PDB file with functional sites
   - Summary report with predictions
""")

Protein-Protein Interaction Prediction

agent.go("""
Predict and validate protein-protein interactions:

Target protein: BRCA1
Species: Human

Analysis:
1. Literature-based interactions:
   - Query BioGRID, STRING, IntAct databases
   - Extract high-confidence interactors (score > 0.7)

2. Structure-based prediction:
   - Predict BRCA1 structure (if not available)
   - Dock with known interactors (BRCA2, BARD1, etc.)
   - Score interfaces (PISA, PDBePISA)

3. Sequence-based prediction:
   - Coevolution analysis (EVcouplings)
   - Domain-domain interaction prediction
   - Linear motif search (ELM database)

4. Functional analysis of interactors:
   - GO enrichment analysis
   - KEGG pathway membership
   - Tissue/cell type expression patterns

5. Network analysis:
   - Build PPI network
   - Identify network modules
   - Central hub proteins

6. Experimental validation suggestions:
   - Prioritize interactions for validation
   - Suggest Co-IP or Y2H experiments
   - Identify commercially available antibodies

7. Output:
   - Ranked interaction list with evidence
   - PPI network visualization
   - Structural models of key interactions
""")

Protein Engineering Design

agent.go("""
Design improved enzyme variant with enhanced thermostability:

Target enzyme: TEM-1 β-lactamase
Goal: Increase melting temperature by >10°C while maintaining activity

Strategy:
1. Analyze current structure:
   - Load PDB structure (1BTL)
   - Identify flexible regions (B-factors)
   - Find potential disulfide bond sites

2. Computational design:
   - Rosetta design simulations
   - Identify stabilizing mutations (ΔΔG < -1.0 kcal/mol)
   - Avoid active site and substrate binding regions

3. Prioritize mutations:
   - Surface entropy reduction (SER)
   - Disulfide bond introduction
   - Salt bridge formation
   - Hydrophobic core packing

4. Check conservation:
   - Multiple sequence alignment of β-lactamases
   - Avoid highly conserved positions
   - Prefer positions with natural variation

5. Design library:
   - Rank top 20 single mutants
   - Design 5 combinatorial variants (2-3 mutations)
   - Ensure codon optimization for E. coli

6. Validation plan:
   - Expression and purification protocol
   - Thermal shift assay (DSF)
   - Activity assay (nitrocefin)
   - Recommended high-throughput screening

7. Output:
   - Ranked mutation list with predicted ΔΔG
   - Structural visualizations
   - Codon-optimized sequences
   - Cloning primers
   - Experimental validation protocol
""")

Drug Discovery and ADMET

Virtual Screening

agent.go("""
Perform virtual screening for SARS-CoV-2 Mpro inhibitors:

Target: SARS-CoV-2 Main protease (Mpro)
Crystal structure: PDB 6LU7

Compound library: ZINC15 drug-like subset (~100K compounds)
File: 'data/zinc_druglike_100k.smi' (SMILES format)

Workflow:
1. Protein preparation:
   - Remove crystallographic waters (keep catalytic waters)
   - Add hydrogens, optimize H-bond network
   - Define binding site (residues within 5Å of native ligand)

2. Ligand preparation:
   - Generate 3D coordinates from SMILES
   - Enumerate tautomers and protonation states
   - Energy minimization

3. Molecular docking:
   - Dock all compounds (AutoDock Vina or Glide)
   - Generate top 3 poses per compound
   - Score binding affinity

4. Consensus scoring:
   - Combine multiple scoring functions
   - Rank compounds by consensus score

5. ADMET filtering:
   - Lipinski's rule of 5
   - BBB permeability (not needed for this target)
   - hERG liability (pIC50 > 5)
   - CYP450 inhibition prediction
   - Toxicity prediction (Tox21)

6. Visual inspection:
   - Top 100 compounds
   - Check key interactions (His41, Cys145 catalytic dyad)
   - Remove PAINS and frequent hitters

7. Final selection:
   - Top 20 compounds for experimental testing
   - Cluster by scaffold diversity

8. Output:
   - Ranked compound list with scores and ADMET properties
   - Docking poses (mol2 or PDB format)
   - 2D interaction diagrams
   - Purchase availability from vendors
""")

ADMET Property Prediction

agent.go("""
Predict ADMET properties for drug candidate series:

Input: 'data/compound_series.smi' (25 analogs, SMILES format)
Lead scaffold: Novel kinase inhibitor series

Properties to predict:
1. Absorption:
   - Caco-2 permeability
   - Human intestinal absorption (HIA)
   - P-glycoprotein substrate

2. Distribution:
   - Plasma protein binding (% bound)
   - Volume of distribution (VDss)
   - Blood-brain barrier permeability (LogBB)

3. Metabolism:
   - CYP450 substrate (1A2, 2C9, 2C19, 2D6, 3A4)
   - CYP450 inhibition (same isoforms)
   - Sites of metabolism (SOM prediction)

4. Excretion:
   - Clearance estimation
   - Half-life prediction
   - Renal excretion likelihood

5. Toxicity:
   - hERG inhibition (cardiotoxicity)
   - AMES mutagenicity
   - Hepatotoxicity
   - Skin sensitization
   - Rat acute toxicity (LD50)

6. Drug-likeness:
   - Lipinski's Ro5
   - QED score
   - Synthetic accessibility

Analysis:
- Compare all analogs in the series
- Structure-property relationships
- Identify best balanced compound
- Suggest modifications for improvement

Output:
- Comprehensive ADMET table
- Radar plots for each compound
- SAR analysis for each property
- Recommendations for next design iteration
""")

Lead Optimization

agent.go("""
Optimize lead compound balancing potency and selectivity:

Current lead:
- IC50 (target kinase): 50 nM
- IC50 (off-target kinases): 100-500 nM (poor selectivity)
- Microsomal stability: t1/2 = 20 min (too short)
- Solubility: 5 μM (low)

Goal: Maintain potency, improve selectivity (>100x), improve PK properties

Strategy:
1. Analyze current binding mode:
   - Docking to target and off-targets
   - Identify selectivity-determining residues
   - Map interaction hotspots

2. Design focused library:
   - Modifications to improve selectivity:
     * Target residues unique to on-target
     * Avoid conserved kinase regions
   - Modifications to improve solubility:
     * Add polar groups to solvent-exposed regions
     * Replace lipophilic groups
   - Modifications to improve metabolic stability:
     * Block metabolically labile positions
     * Replace metabolically unstable groups

3. Virtual enumeration:
   - Generate ~200 analogs
   - Predict binding affinity (docking)
   - Predict ADMET properties

4. Multi-parameter optimization:
   - Calculate MPO score (potency + selectivity + ADMET)
   - Pareto optimization
   - Select top 20 compounds

5. Clustering and diversity:
   - Ensure structural diversity
   - Test different modification strategies

6. Synthetic feasibility:
   - Retrosynthetic analysis
   - Flag difficult syntheses
   - Prioritize 10 compounds for synthesis

7. Deliverables:
   - Ranked compound designs
   - Predicted properties table
   - Binding mode visualizations
   - Synthetic routes
   - Recommended testing cascade
""")

Pathway and Network Analysis

Pathway Enrichment Analysis

agent.go("""
Perform comprehensive pathway enrichment on differentially expressed genes:

Input: 'data/DEGs.csv'
Columns: gene_symbol, log2FC, padj
Significant DEGs: padj < 0.05, |log2FC| > 1
Total: 450 upregulated, 380 downregulated genes

Background: all detected genes in the experiment (~15,000)

Analysis:
1. GO enrichment (biological processes):
   - Test upregulated and downregulated genes separately
   - Use hypergeometric test
   - FDR correction (Benjamini-Hochberg)
   - Filter: padj < 0.05, fold enrichment > 2

2. KEGG pathway enrichment:
   - Same approach as GO
   - Focus on signaling and metabolic pathways

3. Reactome pathway enrichment:
   - More detailed pathway hierarchy

4. Disease association:
   - DisGeNET disease enrichment
   - Compare to disease gene signatures (MSigDB)

5. Transcription factor enrichment:
   - Predict upstream regulators (ChEA3)
   - ENCODE ChIP-seq enrichment

6. Drug/compound perturbations:
   - L1000 connectivity map
   - Identify drugs that reverse/mimic signature

7. Cross-pathway analysis:
   - Pathway crosstalk
   - Hierarchical clustering of pathways by gene overlap
   - Network visualization of enriched pathways

8. Visualization:
   - Dot plots (GO, KEGG, Reactome)
   - Enrichment map network
   - Chord diagram (genes-pathways)
   - Treemap of hierarchical GO terms

9. Export:
   - All enrichment tables
   - Pathway gene lists
   - Interactive HTML report
""")

Protein-Protein Interaction Network

agent.go("""
Build and analyze PPI network for Alzheimer's disease genes:

Seed genes: Known AD risk genes (APP, PSEN1, PSEN2, APOE, MAPT, etc.)
File: 'data/AD_seed_genes.txt'

Network construction:
1. Build network from seed genes:
   - Query STRING database (confidence > 0.7)
   - Include direct and second-degree interactors
   - Maximum network size: 500 proteins

2. Network enrichment:
   - Add disease associations (DisGeNET)
   - Add tissue expression (GTEx - prioritize brain)
   - Add functional annotations (GO, Reactome)

3. Network analysis:
   - Calculate centrality measures:
     * Degree centrality
     * Betweenness centrality
     * Eigenvector centrality
   - Identify hub proteins
   - Community detection (Louvain algorithm)

4. Module analysis:
   - Functional enrichment per community
   - Identify disease-relevant modules
   - Key bridge proteins between modules

5. Druggability analysis:
   - Identify druggable targets (DGIdb)
   - Known drugs targeting network proteins
   - Clinical trial status

6. Network perturbation:
   - Simulate gene knockout
   - Network robustness analysis
   - Identify critical nodes

7. Visualization:
   - Interactive network (Cytoscape format)
   - Layout by module membership
   - Color by centrality/expression
   - Size by degree

8. Prioritization:
   - Rank proteins by:
     * Network centrality
     * Brain expression
     * Druggability
     * Genetic evidence (GWAS)
   - Top therapeutic targets

Output:
- Network file (graphML, SIF)
- Module membership table
- Prioritized target list
- Druggable targets with existing compounds
""")

Gene Regulatory Network Inference

agent.go("""
Infer gene regulatory network from scRNA-seq data:

Data: 'data/development_timecourse.h5ad'
- Cells from 5 developmental timepoints
- 3000 HVGs quantified

Goal: Identify TF→target relationships during development

Methods:
1. Preprocessing:
   - Select TFs (from TF census list)
   - Select potential target genes (HVGs)
   - Normalize expression

2. GRN inference using multiple methods:
   - GENIE3 (random forest)
   - SCENIC (motif-based)
   - CellOracle (perturbation-based)
   - Pearson/Spearman correlation (baseline)

3. Integrate predictions:
   - Combine scores from multiple methods
   - Weight by motif evidence (JASPAR)
   - Filter low-confidence edges

4. Network refinement:
   - Remove indirect edges (transitive reduction)
   - Validate with ChIP-seq data (if available)
   - Literature validation (TRRUST database)

5. Dynamic network analysis:
   - TF activity per timepoint/cell state
   - Identify stage-specific regulators
   - Find regulatory switches

6. Downstream analysis:
   - Master regulators (high out-degree)
   - Regulatory cascades
   - Feed-forward loops
   - Coherent vs incoherent motifs

7. Experimental validation priorities:
   - Rank TF→target edges for validation
   - Suggest ChIP-seq or CUT&RUN experiments
   - Suggest perturbation experiments (knockout/CRISPRi)

8. Visualization:
   - Full GRN network (Cytoscape)
   - Key TF subnetworks
   - TF activity heatmap across development
   - Sankey diagram of regulatory flow

Output:
- Edge list with confidence scores
- TF activity matrix
- Validated vs novel interactions
- Prioritized validation experiments
""")

Disease Classification

Cancer Type Classification from Gene Expression

agent.go("""
Build multi-class classifier for cancer type prediction:

Data: TCGA pan-cancer RNA-seq data
- Training: 8000 samples across 33 cancer types
- Expression: 'data/tcga_expression.csv' (samples × genes)
- Labels: 'data/tcga_labels.csv' (sample_id, cancer_type)

Task: Classify tumor samples by cancer type

Pipeline:
1. Data preprocessing:
   - Log2(TPM + 1) transformation
   - Remove low-variance genes (variance < 0.1)
   - Z-score normalization

2. Feature selection:
   - Variance filtering (top 5000 genes)
   - Univariate feature selection (ANOVA F-test)
   - Select top 500 features

3. Train-test split:
   - 80% train, 20% test
   - Stratified by cancer type

4. Model training (compare multiple algorithms):
   - Random Forest
   - Gradient Boosting (XGBoost)
   - Neural Network (MLP)
   - Elastic Net logistic regression

5. Model evaluation:
   - Accuracy, precision, recall per class
   - Confusion matrix
   - ROC curves (one-vs-rest)
   - Feature importance ranking

6. Model interpretation:
   - SHAP values for predictions
   - Top predictive genes per cancer type
   - Pathway enrichment of predictive features

7. Clinical validation:
   - Test on independent dataset (if available)
   - Analyze misclassifications
   - Identify hard-to-classify subtypes

8. Deliverables:
   - Trained model (pickle)
   - Performance metrics report
   - Feature importance table
   - Confusion matrix heatmap
   - Prediction script for new samples
""")

Disease Risk Prediction from Multi-Omics

agent.go("""
Develop integrative model predicting cardiovascular disease risk:

Data sources:
1. Genotypes: 'data/genotypes.csv' (500K SNPs, polygenic risk scores)
2. Clinical: 'data/clinical.csv' (age, sex, BMI, blood pressure, cholesterol)
3. Proteomics: 'data/proteomics.csv' (200 plasma proteins, Olink panel)
4. Metabolomics: 'data/metabolomics.csv' (150 metabolites)

Outcome: 10-year CVD incidence (binary)
- Cases: 800
- Controls: 3200

Approach:
1. Data preprocessing:
   - Impute missing values (missForest)
   - Transform skewed features (log/Box-Cox)
   - Normalize each omics layer separately

2. Feature engineering:
   - Calculate PRS from SNP data
   - Interaction terms (age × metabolites, etc.)
   - Metabolite ratios (known CVD markers)

3. Feature selection per omics:
   - Lasso for each data type
   - Select informative features

4. Integration strategies (compare):
   - Early integration: concatenate all features
   - Late integration: separate models, combine predictions
   - Intermediate integration: Multi-omics factor analysis (MOFA)

5. Model development:
   - Logistic regression (interpretable baseline)
   - Random Forest
   - Elastic Net
   - Neural network with omics-specific layers

6. Cross-validation:
   - 5-fold CV, stratified
   - Hyperparameter tuning
   - Calculate confidence intervals

7. Model evaluation:
   - AUC-ROC, AUC-PR
   - Calibration plots
   - Net reclassification improvement (NRI)
   - Compare to clinical models (Framingham, SCORE)

8. Interpretation:
   - Feature importance (permutation importance)
   - SHAP values for individuals
   - Identify most informative omics layer

9. Clinical utility:
   - Decision curve analysis
   - Risk stratification groups
   - Biomarker panel selection

Outputs:
- Model comparison table
- ROC curves all models
- Feature importance per omics
- Reclassification table
- Clinical implementation recommendations
""")

Multi-Omics Integration

Multi-Omics Data Integration

agent.go("""
Integrate transcriptomics, proteomics, and metabolomics data:

Study: Drug response in cancer cell lines
Data:
- RNA-seq: 'data/transcriptomics.csv' (15000 genes × 50 cell lines)
- Proteomics: 'data/proteomics.csv' (3000 proteins × 50 cell lines)
- Metabolomics: 'data/metabolomics.csv' (200 metabolites × 50 cell lines)
- Drug response: 'data/drug_response.csv' (cell line, drug, IC50)

Goal: Identify multi-omics signatures of drug sensitivity

Analysis:
1. Data preprocessing:
   - Match samples across omics layers
   - Filter low-variance features per omics
   - Normalize each omics separately (z-score)

2. Integration methods (compare):

   **Method 1: MOFA (Multi-Omics Factor Analysis)**
   - Identify latent factors capturing variance across omics
   - Determine factor contributions per omics
   - Relate factors to drug response

   **Method 2: DIABLO (sparse PLS-DA)**
   - Supervised integration
   - Maximize covariance between omics and drug response
   - Select features from each omics layer

   **Method 3: Similarity Network Fusion (SNF)**
   - Build patient similarity networks per omics
   - Fuse networks
   - Cluster cell lines by integrated similarity

3. Association with drug response:
   - Correlation of factors/components with IC50
   - Identify drug-sensitive vs resistant groups
   - Multi-omics biomarkers

4. Network analysis:
   - Build multi-layer network:
     * Gene regulatory network (RNA)
     * Protein-protein interactions (proteins)
     * Gene-metabolite associations
   - Integrate layers
   - Find dysregulated pathways

5. Predictive modeling:
   - Train model predicting drug response from multi-omics
   - Compare: using all omics vs individual omics
   - Feature selection across omics

6. Biological interpretation:
   - Map features to pathways
   - Identify mechanism of drug action
   - Suggest combination therapies

7. Validation:
   - Leave-one-out cross-validation
   - Test in independent cell line panel

Outputs:
- Factor loadings per omics (MOFA)
- Multi-omics biomarker signature
- Integrated network visualization
- Predictive model of drug response
- Mechanistic hypotheses
""")

Proteomics Analysis

Label-Free Quantitative Proteomics

agent.go("""
Analyze label-free proteomics data from mass spectrometry:

Study: Comparison of normal vs diseased tissue (n=6 per group)
Data: MaxQuant output
- 'data/proteinGroups.txt' (MaxQuant protein quantification)
- 'data/peptides.txt' (peptide-level data)

Experimental design:
- 6 normal samples
- 6 disease samples
- TMT-labeled, 3 fractions each

Analysis:
1. Data loading and QC:
   - Load proteinGroups.txt
   - Remove contaminants, reverse hits
   - Filter: valid values in ≥50% of samples per group
   - Check sample correlations and outliers
   - PCA for quality assessment

2. Imputation:
   - Impute missing values (MAR vs MNAR approach)
   - Use MinProb for low-abundance missing values
   - Use kNN for random missing values

3. Normalization:
   - Median normalization
   - Or: VSN (variance stabilizing normalization)

4. Differential expression:
   - Two-sample t-test (for each protein)
   - Moderated t-test (limma)
   - Filter: |log2FC| > 0.58 (~1.5-fold), adj.p < 0.05

5. Visualization:
   - Volcano plot
   - Heatmap of significant proteins
   - PCA colored by condition
   - Intensity distributions (before/after normalization)

6. Functional enrichment:
   - GO enrichment (up and down separately)
   - KEGG pathways
   - Reactome pathways
   - STRING PPI network of DEPs

7. PTM analysis (if available):
   - Phosphorylation site analysis
   - Kinase enrichment analysis (KEA3)

8. Orthogonal validation:
   - Compare to RNA-seq data (if available)
   - Protein-RNA correlation
   - Identify discordant genes

9. Biomarker candidates:
   - Rank proteins by fold-change and significance
   - Filter for secreted proteins (potential biomarkers)
   - Check if targetable (druggable)

Outputs:
- Differential abundance table
- QC report with plots
- Enrichment analysis results
- PPI network of DEPs
- Candidate biomarkers list
""")

Biomarker Discovery

Diagnostic Biomarker Discovery

agent.go("""
Discover diagnostic biomarkers for early cancer detection:

Study: Plasma proteomics comparing early-stage cancer vs healthy controls
Data:
- 'data/proteomics.csv' (1000 proteins × 200 samples)
- 'data/metadata.csv' (sample_id, group [cancer/healthy], age, sex)

Groups:
- Early-stage cancer: 100 samples
- Healthy controls: 100 samples

Goal: Identify protein panel for early detection (target AUC > 0.90)

Workflow:
1. Exploratory analysis:
   - PCA, tSNE to visualize separation
   - Univariate differential abundance
   - Volcano plot

2. Feature selection:
   - Rank proteins by:
     * Fold change
     * Statistical significance (t-test, Mann-Whitney)
     * AUC (each protein individually)
   - Select proteins with AUC > 0.70

3. Biomarker panel construction:
   - Correlation analysis (remove redundant markers)
   - Forward selection:
     * Start with best single marker
     * Add markers improving panel performance
     * Stop when no improvement
   - Aim for 5-10 marker panel (practical for assay)

4. Model building:
   - Logistic regression on selected panel
   - Calculate combined risk score
   - Cross-validation (10-fold)

5. Performance evaluation:
   - AUC-ROC, AUC-PR
   - Sensitivity/specificity at different thresholds
   - Clinical decision threshold (e.g., 90% sensitivity)
   - Calibration plot

6. Biological validation:
   - Literature support for cancer association
   - Expression in tumor vs blood
   - Mechanism of release/shedding

7. Clinical utility:
   - Compare to existing biomarkers (CEA, CA19-9, etc.)
   - Cost-effectiveness consideration
   - Assay feasibility (ELISA, MRM, etc.)

8. Independent validation plan:
   - Power calculation for validation cohort
   - Suggested sample size
   - Pre-analytical variables to control

Outputs:
- Ranked protein list with individual performance
- Final biomarker panel
- Logistic regression model
- ROC curves (individual + panel)
- Clinical characteristics table
- Validation study protocol
""")

Additional Advanced Examples

Spatial Transcriptomics Analysis

agent.go("""
Analyze Visium spatial transcriptomics data:

Data: 'data/visium_brain_tumor.h5ad'
- Contains spatial coordinates and gene expression
- Tissue: Brain tumor biopsy

Analysis:
1. Data QC and normalization:
   - Filter low-quality spots (total counts, detected genes)
   - Normalize, log-transform
   - Calculate spatial statistics

2. Spatial clustering:
   - Graph-based clustering considering spatial proximity
   - Identify tumor regions, stroma, necrosis, etc.

3. Spatially variable genes:
   - Test for spatial patterns (Moran's I, SpatialDE)
   - Identify genes with spatial gradients

4. Deconvolution:
   - Estimate cell type composition per spot
   - Use scRNA-seq reference (if available)
   - Methods: Cell2location, RCTD, SPOTlight

5. Niche analysis:
   - Define tissue niches by cell type composition
   - Identify tumor-stroma interface
   - Analyze cell-cell interactions

6. Spatial pathway analysis:
   - Map pathway activity onto tissue
   - Identify spatially localized processes

7. Visualization:
   - Spatial plots colored by cluster, gene expression
   - Cell type composition maps
   - Pathway activity maps

Output:
- Annotated spatial data object
- Spatially variable gene list
- Cell type composition per spot
- Niche definitions and cell-cell interactions
""")

Tips for Effective Task Specification

1. Be Specific About Data Formats and Locations

Good:

agent.go("Analyze scRNA-seq data in AnnData format at 'data/experiment1.h5ad'")

Vague:

agent.go("Analyze my data")

2. Specify Analysis Parameters

Good:

agent.go("""
Cluster cells using Leiden algorithm with resolution 0.5,
k-neighbors=10, using PCA components 1-30
""")

Vague:

agent.go("Cluster the cells")

3. Request Specific Outputs

Good:

agent.go("""
... and save results as:
- CSV table with statistics
- PNG figures at 300 DPI
- Processed data as AnnData at 'results/processed.h5ad'
""")

Vague:

agent.go("... and save the results")

4. Provide Biological Context

Good:

agent.go("""
This is a drug treatment experiment. Compare vehicle vs treated groups
to identify drug-induced transcriptional changes. Focus on apoptosis and
cell cycle pathways.
""")

Vague:

agent.go("Compare the two groups")

5. Break Complex Analyses into Steps

Good:

# Step 1
agent.go("Load and QC the data, save QC metrics")

# Step 2
agent.go("Based on QC, normalize and find HVGs")

# Step 3
agent.go("Cluster and annotate cell types")

Overwhelming:

agent.go("Do a complete scRNA-seq analysis pipeline")