Files
claude-scientific-skills/scientific-packages/biomni/references/task_examples.md
2025-10-19 14:12:02 -07:00

1473 lines
36 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Biomni Task Examples
Comprehensive collection of biomedical task examples with code patterns and best practices.
## Table of Contents
1. [Single-Cell RNA-seq Analysis](#single-cell-rna-seq-analysis)
2. [CRISPR Screening](#crispr-screening)
3. [Genomic Analysis (GWAS, Variant Calling)](#genomic-analysis)
4. [Protein Structure and Function](#protein-structure-and-function)
5. [Drug Discovery and ADMET](#drug-discovery-and-admet)
6. [Pathway and Network Analysis](#pathway-and-network-analysis)
7. [Disease Classification](#disease-classification)
8. [Multi-Omics Integration](#multi-omics-integration)
9. [Proteomics Analysis](#proteomics-analysis)
10. [Biomarker Discovery](#biomarker-discovery)
---
## Single-Cell RNA-seq Analysis
### Basic scRNA-seq Pipeline
```python
from biomni.agent import A1
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
agent.go("""
Analyze the 10X Genomics scRNA-seq dataset located at 'data/pbmc_10k.h5ad'.
Workflow:
1. Load the data and perform QC:
- Filter cells with <200 genes or >5000 genes
- Filter cells with >10% mitochondrial reads
- Filter genes expressed in <3 cells
2. Normalize and identify highly variable genes:
- Use SCTransform or standard log-normalization
- Identify top 2000 HVGs
3. Dimensionality reduction:
- PCA (50 components)
- UMAP for visualization
4. Clustering:
- Find neighbors (k=10)
- Leiden clustering with resolution 0.5
5. Visualization:
- UMAP colored by cluster
- QC metrics on UMAP
Save processed data as 'results/pbmc_processed.h5ad'
""")
```
### Cell Type Annotation
```python
agent.go("""
Using the processed PBMC data at 'results/pbmc_processed.h5ad':
1. Find marker genes for each cluster:
- Wilcoxon rank-sum test
- Log fold change > 0.5
- Adjusted p-value < 0.01
- Present in >25% of cluster cells
2. Annotate cell types using markers:
- T cells: CD3D, CD3E, CD3G
- B cells: CD19, MS4A1 (CD20)
- NK cells: GNLY, NKG7, NCAM1
- Monocytes: CD14, LYZ, CD68
- Dendritic cells: FCER1A, CD1C
3. Create visualization:
- UMAP with cell type labels
- Dotplot of marker genes by cell type
- Proportion of cell types (bar plot)
4. Save annotated data with cell types
""")
```
### Differential Expression Between Conditions
```python
agent.go("""
Compare gene expression between stimulated and control conditions:
Data: 'data/immune_stim_experiment.h5ad' (contains 'condition' metadata)
Analysis:
1. Subset to T cells only (cell_type == 'T cell')
2. Differential expression between stim vs control:
- Use pseudobulk approach (aggregate by donor + condition)
- DESeq2 or edgeR for statistical testing
- Filter: |log2FC| > 1, padj < 0.05
3. Pathway enrichment on DEGs:
- Use GO biological processes
- Use KEGG pathways
- Run enrichment analysis with gprofiler or enrichr
4. Visualization:
- Volcano plot of DEGs
- Heatmap of top 50 DEGs
- Bar plot of top enriched pathways
5. Export results table with gene symbols, log2FC, p-values, and pathway annotations
""")
```
### Trajectory Analysis
```python
agent.go("""
Perform pseudotime trajectory analysis on hematopoietic differentiation data:
Data: 'data/hematopoiesis.h5ad'
Steps:
1. Subset to progenitor and mature cell types:
- HSC, MPP, GMP, Monocytes, Neutrophils
2. Run trajectory inference:
- Use PAGA or Monocle3
- Set HSC as root cell type
3. Calculate pseudotime for all cells
4. Identify trajectory-associated genes:
- Genes that change along pseudotime
- Statistical test with FDR < 0.05
- Cluster genes by expression pattern (early, middle, late)
5. Visualization:
- UMAP colored by pseudotime
- Heatmap of trajectory genes
- Gene expression along pseudotime for key TFs
6. Functional analysis:
- GO enrichment for early/middle/late gene clusters
""")
```
### Integration of Multiple Datasets
```python
agent.go("""
Integrate three scRNA-seq datasets from different batches:
Data files:
- 'data/batch1_pbmc.h5ad'
- 'data/batch2_pbmc.h5ad'
- 'data/batch3_pbmc.h5ad'
Integration workflow:
1. Load all three datasets
2. Perform individual QC on each batch:
- Same filters as standard QC
- Note batch-specific statistics
3. Integration using Harmony or Scanorama:
- Concatenate datasets
- Identify HVGs on combined data
- Run batch correction
- Verify batch mixing with LISI score
4. Re-cluster integrated data:
- Use corrected embeddings
- Leiden clustering
5. Cell type annotation on integrated data
6. Visualization:
- UMAP split by batch (before/after correction)
- UMAP colored by cell type
- Batch mixing statistics
7. Save integrated dataset
""")
```
---
## CRISPR Screening
### Guide RNA Design
```python
agent.go("""
Design guide RNAs for CRISPR knockout screening of cell cycle genes:
Target genes:
- CDK1, CDK2, CDK4, CDK6
- CCNA2, CCNB1, CCND1, CCNE1
- TP53, RB1, MYC
Requirements:
1. Design 4-6 guides per gene targeting early exons
2. For each guide, evaluate:
- On-target efficiency score (Doench 2016)
- Off-target potential (CFD score < 0.3)
- Avoid common SNPs (1000 Genomes)
3. Add control guides:
- 100 non-targeting controls
- 20 positive controls (essential genes)
4. Output:
- Table with: gene, guide_sequence, PAM, position, on_target_score, off_target_count
- Sequences in format for oligonucleotide ordering
- Visual summary of guide distribution per gene
Reference genome: hg38
""")
```
### CRISPR Screen Analysis
```python
agent.go("""
Analyze data from a genome-wide CRISPR knockout screen:
Data: 'data/crispr_screen_counts.csv'
- Columns: guide_id, gene, sample_T0, sample_T15, replicate
- ~80,000 guides targeting ~18,000 genes
Analysis:
1. Quality control:
- Guide representation (reads per guide)
- Sample correlation
- Remove guides with <30 reads in T0
2. Normalize counts:
- Reads per million (RPM)
- Log2 fold change (T15 vs T0)
3. Statistical analysis using MAGeCK:
- Identify significantly depleted/enriched genes
- FDR < 0.05
- Rank genes by robust rank aggregation (RRA)
4. Functional analysis:
- Pathway enrichment of hit genes
- Known vs novel essential genes
- Correlation with Cancer Dependency Map
5. Visualization:
- Scatterplot: log2FC vs -log10(FDR)
- Heatmap: top 50 depleted genes across replicates
- Network: PPI network of hit genes
6. Export:
- Ranked gene list with statistics
- Enriched pathways table
""")
```
### Pooled Optical Screening Analysis
```python
agent.go("""
Analyze pooled CRISPR screen with imaging readout (e.g., Cell Painting):
Data structure:
- 'data/guide_assignments.csv': cell_id, guide_id, gene
- 'data/morphology_features.csv': cell_id, feature_1...feature_500
Analysis:
1. Feature preprocessing:
- Remove low-variance features
- Normalize features (z-score per plate)
- PCA for dimensionality reduction
2. Associate phenotypes with perturbations:
- Aggregate cells by guide (mean/median)
- Calculate morphological distance from controls
- Statistical test for phenotype change
3. Identify phenotype-altering genes:
- Mahalanobis distance from control distribution
- Bonferroni correction for multiple testing
- Effect size threshold
4. Cluster genes by phenotype similarity:
- Hierarchical clustering of gene profiles
- Identify phenotype classes
5. Validation and interpretation:
- Compare to known gene functions
- Pathway enrichment per phenotype cluster
6. Visualization:
- UMAP of all perturbations
- Heatmap of gene clusters × morphology features
- Representative images for each cluster
""")
```
---
## Genomic Analysis
### GWAS Analysis
```python
agent.go("""
Perform genome-wide association study for Type 2 Diabetes:
Data:
- 'data/genotypes.bed' (PLINK format, 500K SNPs, 5000 cases, 5000 controls)
- 'data/phenotypes.txt' (sample_id, T2D_status, age, sex, BMI, ancestry_PCs)
Workflow:
1. Quality control:
- SNP QC: MAF > 0.01, HWE p > 1e-6, genotyping rate > 0.95
- Sample QC: genotyping rate > 0.95, heterozygosity check
- Remove related individuals (kinship > 0.125)
2. Association testing:
- Logistic regression: T2D ~ SNP + age + sex + BMI + PC1-10
- Genome-wide significance threshold: p < 5e-8
- Suggestive threshold: p < 1e-5
3. Post-GWAS analysis:
- LD clumping (r² > 0.1, 500kb window)
- Annotate lead SNPs with nearby genes (±100kb)
- Query GWAS Catalog for known associations
4. Functional annotation:
- Overlap with regulatory elements (ENCODE)
- eQTL colocalization (GTEx)
- GWAS prioritization scores (PoPS, ABC)
5. Visualization:
- Manhattan plot
- QQ plot
- Regional association plots for top loci
- Locus zoom plots
6. Heritability and genetic correlation:
- SNP heritability (LDSC)
- Genetic correlation with related traits
Export summary statistics for meta-analysis
""")
```
### Whole Exome Sequencing Analysis
```python
agent.go("""
Analyze whole exome sequencing data for rare disease diagnosis:
Data: Family trio (proband, mother, father)
- 'data/proband.bam'
- 'data/mother.bam'
- 'data/father.bam'
Phenotype: Developmental delay, seizures, intellectual disability
Pipeline:
1. Variant calling:
- GATK HaplotypeCaller on each sample
- Joint genotyping across trio
- VQSR filtering (SNPs and indels separately)
2. Variant annotation:
- Functional consequence (VEP or ANNOVAR)
- Population frequencies (gnomAD)
- Pathogenicity predictions (CADD, REVEL, SpliceAI)
- Disease databases (ClinVar, OMIM)
3. Inheritance analysis:
- De novo variants (absent in both parents)
- Compound heterozygous variants
- Rare homozygous variants (autozygosity)
- X-linked variants (if proband is male)
4. Filtering strategy:
- Population AF < 0.001 (gnomAD)
- High-quality variants (GQ > 20, DP > 10)
- Loss-of-function or missense with CADD > 20
- Match phenotype to gene function (HPO terms)
5. Prioritization:
- Known disease genes for phenotype
- De novo in intolerant genes (pLI > 0.9)
- Protein-truncating variants
6. Report:
- Top candidate variants with evidence
- Gene function and disease association
- Segregation analysis
- Recommended validation (Sanger sequencing)
- ACMG variant classification
Save VCF with annotations and prioritized candidate list
""")
```
### Variant Calling from RNA-seq
```python
agent.go("""
Identify expressed variants from RNA-seq data:
Data: Tumor RNA-seq BAM file
- 'data/tumor_RNAseq.bam'
- Reference: hg38
Purpose: Identify expressed somatic mutations for neoantigen prediction
Steps:
1. Pre-processing:
- Mark duplicates (Picard)
- Split reads at junctions (GATK SplitNCigarReads)
- Base quality recalibration
2. Variant calling:
- GATK HaplotypeCaller (RNA-seq mode)
- Filter: DP > 10, AF > 0.05
3. Filtering artifacts:
- Remove common SNPs (gnomAD AF > 0.001)
- Filter intronic/intergenic variants
- Remove known RNA editing sites (RADAR database)
- Panel of normals (if available)
4. Annotation:
- Functional impact (VEP)
- Identify non-synonymous variants
- Predict MHC binding (NetMHCpan for patient HLA type)
5. Prioritize neoantigens:
- Strong MHC binding (IC50 < 500nM)
- High expression (TPM > 5)
- High variant allele frequency
6. Output:
- Annotated VCF
- Neoantigen candidates table
- Peptide sequences for validation
This requires patient HLA typing data
""")
```
---
## Protein Structure and Function
### Protein Structure Prediction and Analysis
```python
agent.go("""
Predict and analyze structure for novel protein sequence:
Sequence (FASTA format):
>Novel_Kinase_Domain
MKLLVVDDDGVADYSKRDGAFMVAYCIEPGDG...
Tasks:
1. Structure prediction:
- Use AlphaFold2 or ESMFold
- Generate 5 models, rank by confidence
2. Quality assessment:
- pLDDT scores (per-residue confidence)
- pTM score (global confidence)
- Identify low-confidence regions
3. Domain identification:
- InterProScan for domain architecture
- Pfam domain search
- Identify catalytic residues
4. Functional site prediction:
- Active site prediction
- Substrate binding pocket identification
- Post-translational modification sites
5. Structural alignment:
- Search for similar structures (PDB)
- Align to close homologs
- Identify conserved structural motifs
6. Mutation analysis:
- Known disease mutations in homologs
- Predict impact on structure (Rosetta ddG)
7. Visualization and output:
- PyMOL/Chimera visualization scripts
- Structural alignment figures
- Annotated PDB file with functional sites
- Summary report with predictions
""")
```
### Protein-Protein Interaction Prediction
```python
agent.go("""
Predict and validate protein-protein interactions:
Target protein: BRCA1
Species: Human
Analysis:
1. Literature-based interactions:
- Query BioGRID, STRING, IntAct databases
- Extract high-confidence interactors (score > 0.7)
2. Structure-based prediction:
- Predict BRCA1 structure (if not available)
- Dock with known interactors (BRCA2, BARD1, etc.)
- Score interfaces (PISA, PDBePISA)
3. Sequence-based prediction:
- Coevolution analysis (EVcouplings)
- Domain-domain interaction prediction
- Linear motif search (ELM database)
4. Functional analysis of interactors:
- GO enrichment analysis
- KEGG pathway membership
- Tissue/cell type expression patterns
5. Network analysis:
- Build PPI network
- Identify network modules
- Central hub proteins
6. Experimental validation suggestions:
- Prioritize interactions for validation
- Suggest Co-IP or Y2H experiments
- Identify commercially available antibodies
7. Output:
- Ranked interaction list with evidence
- PPI network visualization
- Structural models of key interactions
""")
```
### Protein Engineering Design
```python
agent.go("""
Design improved enzyme variant with enhanced thermostability:
Target enzyme: TEM-1 β-lactamase
Goal: Increase melting temperature by >10°C while maintaining activity
Strategy:
1. Analyze current structure:
- Load PDB structure (1BTL)
- Identify flexible regions (B-factors)
- Find potential disulfide bond sites
2. Computational design:
- Rosetta design simulations
- Identify stabilizing mutations (ΔΔG < -1.0 kcal/mol)
- Avoid active site and substrate binding regions
3. Prioritize mutations:
- Surface entropy reduction (SER)
- Disulfide bond introduction
- Salt bridge formation
- Hydrophobic core packing
4. Check conservation:
- Multiple sequence alignment of β-lactamases
- Avoid highly conserved positions
- Prefer positions with natural variation
5. Design library:
- Rank top 20 single mutants
- Design 5 combinatorial variants (2-3 mutations)
- Ensure codon optimization for E. coli
6. Validation plan:
- Expression and purification protocol
- Thermal shift assay (DSF)
- Activity assay (nitrocefin)
- Recommended high-throughput screening
7. Output:
- Ranked mutation list with predicted ΔΔG
- Structural visualizations
- Codon-optimized sequences
- Cloning primers
- Experimental validation protocol
""")
```
---
## Drug Discovery and ADMET
### Virtual Screening
```python
agent.go("""
Perform virtual screening for SARS-CoV-2 Mpro inhibitors:
Target: SARS-CoV-2 Main protease (Mpro)
Crystal structure: PDB 6LU7
Compound library: ZINC15 drug-like subset (~100K compounds)
File: 'data/zinc_druglike_100k.smi' (SMILES format)
Workflow:
1. Protein preparation:
- Remove crystallographic waters (keep catalytic waters)
- Add hydrogens, optimize H-bond network
- Define binding site (residues within 5Å of native ligand)
2. Ligand preparation:
- Generate 3D coordinates from SMILES
- Enumerate tautomers and protonation states
- Energy minimization
3. Molecular docking:
- Dock all compounds (AutoDock Vina or Glide)
- Generate top 3 poses per compound
- Score binding affinity
4. Consensus scoring:
- Combine multiple scoring functions
- Rank compounds by consensus score
5. ADMET filtering:
- Lipinski's rule of 5
- BBB permeability (not needed for this target)
- hERG liability (pIC50 > 5)
- CYP450 inhibition prediction
- Toxicity prediction (Tox21)
6. Visual inspection:
- Top 100 compounds
- Check key interactions (His41, Cys145 catalytic dyad)
- Remove PAINS and frequent hitters
7. Final selection:
- Top 20 compounds for experimental testing
- Cluster by scaffold diversity
8. Output:
- Ranked compound list with scores and ADMET properties
- Docking poses (mol2 or PDB format)
- 2D interaction diagrams
- Purchase availability from vendors
""")
```
### ADMET Property Prediction
```python
agent.go("""
Predict ADMET properties for drug candidate series:
Input: 'data/compound_series.smi' (25 analogs, SMILES format)
Lead scaffold: Novel kinase inhibitor series
Properties to predict:
1. Absorption:
- Caco-2 permeability
- Human intestinal absorption (HIA)
- P-glycoprotein substrate
2. Distribution:
- Plasma protein binding (% bound)
- Volume of distribution (VDss)
- Blood-brain barrier permeability (LogBB)
3. Metabolism:
- CYP450 substrate (1A2, 2C9, 2C19, 2D6, 3A4)
- CYP450 inhibition (same isoforms)
- Sites of metabolism (SOM prediction)
4. Excretion:
- Clearance estimation
- Half-life prediction
- Renal excretion likelihood
5. Toxicity:
- hERG inhibition (cardiotoxicity)
- AMES mutagenicity
- Hepatotoxicity
- Skin sensitization
- Rat acute toxicity (LD50)
6. Drug-likeness:
- Lipinski's Ro5
- QED score
- Synthetic accessibility
Analysis:
- Compare all analogs in the series
- Structure-property relationships
- Identify best balanced compound
- Suggest modifications for improvement
Output:
- Comprehensive ADMET table
- Radar plots for each compound
- SAR analysis for each property
- Recommendations for next design iteration
""")
```
### Lead Optimization
```python
agent.go("""
Optimize lead compound balancing potency and selectivity:
Current lead:
- IC50 (target kinase): 50 nM
- IC50 (off-target kinases): 100-500 nM (poor selectivity)
- Microsomal stability: t1/2 = 20 min (too short)
- Solubility: 5 μM (low)
Goal: Maintain potency, improve selectivity (>100x), improve PK properties
Strategy:
1. Analyze current binding mode:
- Docking to target and off-targets
- Identify selectivity-determining residues
- Map interaction hotspots
2. Design focused library:
- Modifications to improve selectivity:
* Target residues unique to on-target
* Avoid conserved kinase regions
- Modifications to improve solubility:
* Add polar groups to solvent-exposed regions
* Replace lipophilic groups
- Modifications to improve metabolic stability:
* Block metabolically labile positions
* Replace metabolically unstable groups
3. Virtual enumeration:
- Generate ~200 analogs
- Predict binding affinity (docking)
- Predict ADMET properties
4. Multi-parameter optimization:
- Calculate MPO score (potency + selectivity + ADMET)
- Pareto optimization
- Select top 20 compounds
5. Clustering and diversity:
- Ensure structural diversity
- Test different modification strategies
6. Synthetic feasibility:
- Retrosynthetic analysis
- Flag difficult syntheses
- Prioritize 10 compounds for synthesis
7. Deliverables:
- Ranked compound designs
- Predicted properties table
- Binding mode visualizations
- Synthetic routes
- Recommended testing cascade
""")
```
---
## Pathway and Network Analysis
### Pathway Enrichment Analysis
```python
agent.go("""
Perform comprehensive pathway enrichment on differentially expressed genes:
Input: 'data/DEGs.csv'
Columns: gene_symbol, log2FC, padj
Significant DEGs: padj < 0.05, |log2FC| > 1
Total: 450 upregulated, 380 downregulated genes
Background: all detected genes in the experiment (~15,000)
Analysis:
1. GO enrichment (biological processes):
- Test upregulated and downregulated genes separately
- Use hypergeometric test
- FDR correction (Benjamini-Hochberg)
- Filter: padj < 0.05, fold enrichment > 2
2. KEGG pathway enrichment:
- Same approach as GO
- Focus on signaling and metabolic pathways
3. Reactome pathway enrichment:
- More detailed pathway hierarchy
4. Disease association:
- DisGeNET disease enrichment
- Compare to disease gene signatures (MSigDB)
5. Transcription factor enrichment:
- Predict upstream regulators (ChEA3)
- ENCODE ChIP-seq enrichment
6. Drug/compound perturbations:
- L1000 connectivity map
- Identify drugs that reverse/mimic signature
7. Cross-pathway analysis:
- Pathway crosstalk
- Hierarchical clustering of pathways by gene overlap
- Network visualization of enriched pathways
8. Visualization:
- Dot plots (GO, KEGG, Reactome)
- Enrichment map network
- Chord diagram (genes-pathways)
- Treemap of hierarchical GO terms
9. Export:
- All enrichment tables
- Pathway gene lists
- Interactive HTML report
""")
```
### Protein-Protein Interaction Network
```python
agent.go("""
Build and analyze PPI network for Alzheimer's disease genes:
Seed genes: Known AD risk genes (APP, PSEN1, PSEN2, APOE, MAPT, etc.)
File: 'data/AD_seed_genes.txt'
Network construction:
1. Build network from seed genes:
- Query STRING database (confidence > 0.7)
- Include direct and second-degree interactors
- Maximum network size: 500 proteins
2. Network enrichment:
- Add disease associations (DisGeNET)
- Add tissue expression (GTEx - prioritize brain)
- Add functional annotations (GO, Reactome)
3. Network analysis:
- Calculate centrality measures:
* Degree centrality
* Betweenness centrality
* Eigenvector centrality
- Identify hub proteins
- Community detection (Louvain algorithm)
4. Module analysis:
- Functional enrichment per community
- Identify disease-relevant modules
- Key bridge proteins between modules
5. Druggability analysis:
- Identify druggable targets (DGIdb)
- Known drugs targeting network proteins
- Clinical trial status
6. Network perturbation:
- Simulate gene knockout
- Network robustness analysis
- Identify critical nodes
7. Visualization:
- Interactive network (Cytoscape format)
- Layout by module membership
- Color by centrality/expression
- Size by degree
8. Prioritization:
- Rank proteins by:
* Network centrality
* Brain expression
* Druggability
* Genetic evidence (GWAS)
- Top therapeutic targets
Output:
- Network file (graphML, SIF)
- Module membership table
- Prioritized target list
- Druggable targets with existing compounds
""")
```
### Gene Regulatory Network Inference
```python
agent.go("""
Infer gene regulatory network from scRNA-seq data:
Data: 'data/development_timecourse.h5ad'
- Cells from 5 developmental timepoints
- 3000 HVGs quantified
Goal: Identify TF→target relationships during development
Methods:
1. Preprocessing:
- Select TFs (from TF census list)
- Select potential target genes (HVGs)
- Normalize expression
2. GRN inference using multiple methods:
- GENIE3 (random forest)
- SCENIC (motif-based)
- CellOracle (perturbation-based)
- Pearson/Spearman correlation (baseline)
3. Integrate predictions:
- Combine scores from multiple methods
- Weight by motif evidence (JASPAR)
- Filter low-confidence edges
4. Network refinement:
- Remove indirect edges (transitive reduction)
- Validate with ChIP-seq data (if available)
- Literature validation (TRRUST database)
5. Dynamic network analysis:
- TF activity per timepoint/cell state
- Identify stage-specific regulators
- Find regulatory switches
6. Downstream analysis:
- Master regulators (high out-degree)
- Regulatory cascades
- Feed-forward loops
- Coherent vs incoherent motifs
7. Experimental validation priorities:
- Rank TF→target edges for validation
- Suggest ChIP-seq or CUT&RUN experiments
- Suggest perturbation experiments (knockout/CRISPRi)
8. Visualization:
- Full GRN network (Cytoscape)
- Key TF subnetworks
- TF activity heatmap across development
- Sankey diagram of regulatory flow
Output:
- Edge list with confidence scores
- TF activity matrix
- Validated vs novel interactions
- Prioritized validation experiments
""")
```
---
## Disease Classification
### Cancer Type Classification from Gene Expression
```python
agent.go("""
Build multi-class classifier for cancer type prediction:
Data: TCGA pan-cancer RNA-seq data
- Training: 8000 samples across 33 cancer types
- Expression: 'data/tcga_expression.csv' (samples × genes)
- Labels: 'data/tcga_labels.csv' (sample_id, cancer_type)
Task: Classify tumor samples by cancer type
Pipeline:
1. Data preprocessing:
- Log2(TPM + 1) transformation
- Remove low-variance genes (variance < 0.1)
- Z-score normalization
2. Feature selection:
- Variance filtering (top 5000 genes)
- Univariate feature selection (ANOVA F-test)
- Select top 500 features
3. Train-test split:
- 80% train, 20% test
- Stratified by cancer type
4. Model training (compare multiple algorithms):
- Random Forest
- Gradient Boosting (XGBoost)
- Neural Network (MLP)
- Elastic Net logistic regression
5. Model evaluation:
- Accuracy, precision, recall per class
- Confusion matrix
- ROC curves (one-vs-rest)
- Feature importance ranking
6. Model interpretation:
- SHAP values for predictions
- Top predictive genes per cancer type
- Pathway enrichment of predictive features
7. Clinical validation:
- Test on independent dataset (if available)
- Analyze misclassifications
- Identify hard-to-classify subtypes
8. Deliverables:
- Trained model (pickle)
- Performance metrics report
- Feature importance table
- Confusion matrix heatmap
- Prediction script for new samples
""")
```
### Disease Risk Prediction from Multi-Omics
```python
agent.go("""
Develop integrative model predicting cardiovascular disease risk:
Data sources:
1. Genotypes: 'data/genotypes.csv' (500K SNPs, polygenic risk scores)
2. Clinical: 'data/clinical.csv' (age, sex, BMI, blood pressure, cholesterol)
3. Proteomics: 'data/proteomics.csv' (200 plasma proteins, Olink panel)
4. Metabolomics: 'data/metabolomics.csv' (150 metabolites)
Outcome: 10-year CVD incidence (binary)
- Cases: 800
- Controls: 3200
Approach:
1. Data preprocessing:
- Impute missing values (missForest)
- Transform skewed features (log/Box-Cox)
- Normalize each omics layer separately
2. Feature engineering:
- Calculate PRS from SNP data
- Interaction terms (age × metabolites, etc.)
- Metabolite ratios (known CVD markers)
3. Feature selection per omics:
- Lasso for each data type
- Select informative features
4. Integration strategies (compare):
- Early integration: concatenate all features
- Late integration: separate models, combine predictions
- Intermediate integration: Multi-omics factor analysis (MOFA)
5. Model development:
- Logistic regression (interpretable baseline)
- Random Forest
- Elastic Net
- Neural network with omics-specific layers
6. Cross-validation:
- 5-fold CV, stratified
- Hyperparameter tuning
- Calculate confidence intervals
7. Model evaluation:
- AUC-ROC, AUC-PR
- Calibration plots
- Net reclassification improvement (NRI)
- Compare to clinical models (Framingham, SCORE)
8. Interpretation:
- Feature importance (permutation importance)
- SHAP values for individuals
- Identify most informative omics layer
9. Clinical utility:
- Decision curve analysis
- Risk stratification groups
- Biomarker panel selection
Outputs:
- Model comparison table
- ROC curves all models
- Feature importance per omics
- Reclassification table
- Clinical implementation recommendations
""")
```
---
## Multi-Omics Integration
### Multi-Omics Data Integration
```python
agent.go("""
Integrate transcriptomics, proteomics, and metabolomics data:
Study: Drug response in cancer cell lines
Data:
- RNA-seq: 'data/transcriptomics.csv' (15000 genes × 50 cell lines)
- Proteomics: 'data/proteomics.csv' (3000 proteins × 50 cell lines)
- Metabolomics: 'data/metabolomics.csv' (200 metabolites × 50 cell lines)
- Drug response: 'data/drug_response.csv' (cell line, drug, IC50)
Goal: Identify multi-omics signatures of drug sensitivity
Analysis:
1. Data preprocessing:
- Match samples across omics layers
- Filter low-variance features per omics
- Normalize each omics separately (z-score)
2. Integration methods (compare):
**Method 1: MOFA (Multi-Omics Factor Analysis)**
- Identify latent factors capturing variance across omics
- Determine factor contributions per omics
- Relate factors to drug response
**Method 2: DIABLO (sparse PLS-DA)**
- Supervised integration
- Maximize covariance between omics and drug response
- Select features from each omics layer
**Method 3: Similarity Network Fusion (SNF)**
- Build patient similarity networks per omics
- Fuse networks
- Cluster cell lines by integrated similarity
3. Association with drug response:
- Correlation of factors/components with IC50
- Identify drug-sensitive vs resistant groups
- Multi-omics biomarkers
4. Network analysis:
- Build multi-layer network:
* Gene regulatory network (RNA)
* Protein-protein interactions (proteins)
* Gene-metabolite associations
- Integrate layers
- Find dysregulated pathways
5. Predictive modeling:
- Train model predicting drug response from multi-omics
- Compare: using all omics vs individual omics
- Feature selection across omics
6. Biological interpretation:
- Map features to pathways
- Identify mechanism of drug action
- Suggest combination therapies
7. Validation:
- Leave-one-out cross-validation
- Test in independent cell line panel
Outputs:
- Factor loadings per omics (MOFA)
- Multi-omics biomarker signature
- Integrated network visualization
- Predictive model of drug response
- Mechanistic hypotheses
""")
```
---
## Proteomics Analysis
### Label-Free Quantitative Proteomics
```python
agent.go("""
Analyze label-free proteomics data from mass spectrometry:
Study: Comparison of normal vs diseased tissue (n=6 per group)
Data: MaxQuant output
- 'data/proteinGroups.txt' (MaxQuant protein quantification)
- 'data/peptides.txt' (peptide-level data)
Experimental design:
- 6 normal samples
- 6 disease samples
- TMT-labeled, 3 fractions each
Analysis:
1. Data loading and QC:
- Load proteinGroups.txt
- Remove contaminants, reverse hits
- Filter: valid values in ≥50% of samples per group
- Check sample correlations and outliers
- PCA for quality assessment
2. Imputation:
- Impute missing values (MAR vs MNAR approach)
- Use MinProb for low-abundance missing values
- Use kNN for random missing values
3. Normalization:
- Median normalization
- Or: VSN (variance stabilizing normalization)
4. Differential expression:
- Two-sample t-test (for each protein)
- Moderated t-test (limma)
- Filter: |log2FC| > 0.58 (~1.5-fold), adj.p < 0.05
5. Visualization:
- Volcano plot
- Heatmap of significant proteins
- PCA colored by condition
- Intensity distributions (before/after normalization)
6. Functional enrichment:
- GO enrichment (up and down separately)
- KEGG pathways
- Reactome pathways
- STRING PPI network of DEPs
7. PTM analysis (if available):
- Phosphorylation site analysis
- Kinase enrichment analysis (KEA3)
8. Orthogonal validation:
- Compare to RNA-seq data (if available)
- Protein-RNA correlation
- Identify discordant genes
9. Biomarker candidates:
- Rank proteins by fold-change and significance
- Filter for secreted proteins (potential biomarkers)
- Check if targetable (druggable)
Outputs:
- Differential abundance table
- QC report with plots
- Enrichment analysis results
- PPI network of DEPs
- Candidate biomarkers list
""")
```
---
## Biomarker Discovery
### Diagnostic Biomarker Discovery
```python
agent.go("""
Discover diagnostic biomarkers for early cancer detection:
Study: Plasma proteomics comparing early-stage cancer vs healthy controls
Data:
- 'data/proteomics.csv' (1000 proteins × 200 samples)
- 'data/metadata.csv' (sample_id, group [cancer/healthy], age, sex)
Groups:
- Early-stage cancer: 100 samples
- Healthy controls: 100 samples
Goal: Identify protein panel for early detection (target AUC > 0.90)
Workflow:
1. Exploratory analysis:
- PCA, tSNE to visualize separation
- Univariate differential abundance
- Volcano plot
2. Feature selection:
- Rank proteins by:
* Fold change
* Statistical significance (t-test, Mann-Whitney)
* AUC (each protein individually)
- Select proteins with AUC > 0.70
3. Biomarker panel construction:
- Correlation analysis (remove redundant markers)
- Forward selection:
* Start with best single marker
* Add markers improving panel performance
* Stop when no improvement
- Aim for 5-10 marker panel (practical for assay)
4. Model building:
- Logistic regression on selected panel
- Calculate combined risk score
- Cross-validation (10-fold)
5. Performance evaluation:
- AUC-ROC, AUC-PR
- Sensitivity/specificity at different thresholds
- Clinical decision threshold (e.g., 90% sensitivity)
- Calibration plot
6. Biological validation:
- Literature support for cancer association
- Expression in tumor vs blood
- Mechanism of release/shedding
7. Clinical utility:
- Compare to existing biomarkers (CEA, CA19-9, etc.)
- Cost-effectiveness consideration
- Assay feasibility (ELISA, MRM, etc.)
8. Independent validation plan:
- Power calculation for validation cohort
- Suggested sample size
- Pre-analytical variables to control
Outputs:
- Ranked protein list with individual performance
- Final biomarker panel
- Logistic regression model
- ROC curves (individual + panel)
- Clinical characteristics table
- Validation study protocol
""")
```
---
## Additional Advanced Examples
### Spatial Transcriptomics Analysis
```python
agent.go("""
Analyze Visium spatial transcriptomics data:
Data: 'data/visium_brain_tumor.h5ad'
- Contains spatial coordinates and gene expression
- Tissue: Brain tumor biopsy
Analysis:
1. Data QC and normalization:
- Filter low-quality spots (total counts, detected genes)
- Normalize, log-transform
- Calculate spatial statistics
2. Spatial clustering:
- Graph-based clustering considering spatial proximity
- Identify tumor regions, stroma, necrosis, etc.
3. Spatially variable genes:
- Test for spatial patterns (Moran's I, SpatialDE)
- Identify genes with spatial gradients
4. Deconvolution:
- Estimate cell type composition per spot
- Use scRNA-seq reference (if available)
- Methods: Cell2location, RCTD, SPOTlight
5. Niche analysis:
- Define tissue niches by cell type composition
- Identify tumor-stroma interface
- Analyze cell-cell interactions
6. Spatial pathway analysis:
- Map pathway activity onto tissue
- Identify spatially localized processes
7. Visualization:
- Spatial plots colored by cluster, gene expression
- Cell type composition maps
- Pathway activity maps
Output:
- Annotated spatial data object
- Spatially variable gene list
- Cell type composition per spot
- Niche definitions and cell-cell interactions
""")
```
---
## Tips for Effective Task Specification
### 1. Be Specific About Data Formats and Locations
✅ Good:
```python
agent.go("Analyze scRNA-seq data in AnnData format at 'data/experiment1.h5ad'")
```
❌ Vague:
```python
agent.go("Analyze my data")
```
### 2. Specify Analysis Parameters
✅ Good:
```python
agent.go("""
Cluster cells using Leiden algorithm with resolution 0.5,
k-neighbors=10, using PCA components 1-30
""")
```
❌ Vague:
```python
agent.go("Cluster the cells")
```
### 3. Request Specific Outputs
✅ Good:
```python
agent.go("""
... and save results as:
- CSV table with statistics
- PNG figures at 300 DPI
- Processed data as AnnData at 'results/processed.h5ad'
""")
```
❌ Vague:
```python
agent.go("... and save the results")
```
### 4. Provide Biological Context
✅ Good:
```python
agent.go("""
This is a drug treatment experiment. Compare vehicle vs treated groups
to identify drug-induced transcriptional changes. Focus on apoptosis and
cell cycle pathways.
""")
```
❌ Vague:
```python
agent.go("Compare the two groups")
```
### 5. Break Complex Analyses into Steps
✅ Good:
```python
# Step 1
agent.go("Load and QC the data, save QC metrics")
# Step 2
agent.go("Based on QC, normalize and find HVGs")
# Step 3
agent.go("Cluster and annotate cell types")
```
❌ Overwhelming:
```python
agent.go("Do a complete scRNA-seq analysis pipeline")
```