mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-01-26 16:58:56 +08:00
1473 lines
36 KiB
Markdown
1473 lines
36 KiB
Markdown
# Biomni Task Examples
|
||
|
||
Comprehensive collection of biomedical task examples with code patterns and best practices.
|
||
|
||
## Table of Contents
|
||
|
||
1. [Single-Cell RNA-seq Analysis](#single-cell-rna-seq-analysis)
|
||
2. [CRISPR Screening](#crispr-screening)
|
||
3. [Genomic Analysis (GWAS, Variant Calling)](#genomic-analysis)
|
||
4. [Protein Structure and Function](#protein-structure-and-function)
|
||
5. [Drug Discovery and ADMET](#drug-discovery-and-admet)
|
||
6. [Pathway and Network Analysis](#pathway-and-network-analysis)
|
||
7. [Disease Classification](#disease-classification)
|
||
8. [Multi-Omics Integration](#multi-omics-integration)
|
||
9. [Proteomics Analysis](#proteomics-analysis)
|
||
10. [Biomarker Discovery](#biomarker-discovery)
|
||
|
||
---
|
||
|
||
## Single-Cell RNA-seq Analysis
|
||
|
||
### Basic scRNA-seq Pipeline
|
||
|
||
```python
|
||
from biomni.agent import A1
|
||
|
||
agent = A1(path='./data', llm='claude-sonnet-4-20250514')
|
||
|
||
agent.go("""
|
||
Analyze the 10X Genomics scRNA-seq dataset located at 'data/pbmc_10k.h5ad'.
|
||
|
||
Workflow:
|
||
1. Load the data and perform QC:
|
||
- Filter cells with <200 genes or >5000 genes
|
||
- Filter cells with >10% mitochondrial reads
|
||
- Filter genes expressed in <3 cells
|
||
|
||
2. Normalize and identify highly variable genes:
|
||
- Use SCTransform or standard log-normalization
|
||
- Identify top 2000 HVGs
|
||
|
||
3. Dimensionality reduction:
|
||
- PCA (50 components)
|
||
- UMAP for visualization
|
||
|
||
4. Clustering:
|
||
- Find neighbors (k=10)
|
||
- Leiden clustering with resolution 0.5
|
||
|
||
5. Visualization:
|
||
- UMAP colored by cluster
|
||
- QC metrics on UMAP
|
||
|
||
Save processed data as 'results/pbmc_processed.h5ad'
|
||
""")
|
||
```
|
||
|
||
### Cell Type Annotation
|
||
|
||
```python
|
||
agent.go("""
|
||
Using the processed PBMC data at 'results/pbmc_processed.h5ad':
|
||
|
||
1. Find marker genes for each cluster:
|
||
- Wilcoxon rank-sum test
|
||
- Log fold change > 0.5
|
||
- Adjusted p-value < 0.01
|
||
- Present in >25% of cluster cells
|
||
|
||
2. Annotate cell types using markers:
|
||
- T cells: CD3D, CD3E, CD3G
|
||
- B cells: CD19, MS4A1 (CD20)
|
||
- NK cells: GNLY, NKG7, NCAM1
|
||
- Monocytes: CD14, LYZ, CD68
|
||
- Dendritic cells: FCER1A, CD1C
|
||
|
||
3. Create visualization:
|
||
- UMAP with cell type labels
|
||
- Dotplot of marker genes by cell type
|
||
- Proportion of cell types (bar plot)
|
||
|
||
4. Save annotated data with cell types
|
||
""")
|
||
```
|
||
|
||
### Differential Expression Between Conditions
|
||
|
||
```python
|
||
agent.go("""
|
||
Compare gene expression between stimulated and control conditions:
|
||
|
||
Data: 'data/immune_stim_experiment.h5ad' (contains 'condition' metadata)
|
||
|
||
Analysis:
|
||
1. Subset to T cells only (cell_type == 'T cell')
|
||
|
||
2. Differential expression between stim vs control:
|
||
- Use pseudobulk approach (aggregate by donor + condition)
|
||
- DESeq2 or edgeR for statistical testing
|
||
- Filter: |log2FC| > 1, padj < 0.05
|
||
|
||
3. Pathway enrichment on DEGs:
|
||
- Use GO biological processes
|
||
- Use KEGG pathways
|
||
- Run enrichment analysis with gprofiler or enrichr
|
||
|
||
4. Visualization:
|
||
- Volcano plot of DEGs
|
||
- Heatmap of top 50 DEGs
|
||
- Bar plot of top enriched pathways
|
||
|
||
5. Export results table with gene symbols, log2FC, p-values, and pathway annotations
|
||
""")
|
||
```
|
||
|
||
### Trajectory Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Perform pseudotime trajectory analysis on hematopoietic differentiation data:
|
||
|
||
Data: 'data/hematopoiesis.h5ad'
|
||
|
||
Steps:
|
||
1. Subset to progenitor and mature cell types:
|
||
- HSC, MPP, GMP, Monocytes, Neutrophils
|
||
|
||
2. Run trajectory inference:
|
||
- Use PAGA or Monocle3
|
||
- Set HSC as root cell type
|
||
|
||
3. Calculate pseudotime for all cells
|
||
|
||
4. Identify trajectory-associated genes:
|
||
- Genes that change along pseudotime
|
||
- Statistical test with FDR < 0.05
|
||
- Cluster genes by expression pattern (early, middle, late)
|
||
|
||
5. Visualization:
|
||
- UMAP colored by pseudotime
|
||
- Heatmap of trajectory genes
|
||
- Gene expression along pseudotime for key TFs
|
||
|
||
6. Functional analysis:
|
||
- GO enrichment for early/middle/late gene clusters
|
||
""")
|
||
```
|
||
|
||
### Integration of Multiple Datasets
|
||
|
||
```python
|
||
agent.go("""
|
||
Integrate three scRNA-seq datasets from different batches:
|
||
|
||
Data files:
|
||
- 'data/batch1_pbmc.h5ad'
|
||
- 'data/batch2_pbmc.h5ad'
|
||
- 'data/batch3_pbmc.h5ad'
|
||
|
||
Integration workflow:
|
||
1. Load all three datasets
|
||
|
||
2. Perform individual QC on each batch:
|
||
- Same filters as standard QC
|
||
- Note batch-specific statistics
|
||
|
||
3. Integration using Harmony or Scanorama:
|
||
- Concatenate datasets
|
||
- Identify HVGs on combined data
|
||
- Run batch correction
|
||
- Verify batch mixing with LISI score
|
||
|
||
4. Re-cluster integrated data:
|
||
- Use corrected embeddings
|
||
- Leiden clustering
|
||
|
||
5. Cell type annotation on integrated data
|
||
|
||
6. Visualization:
|
||
- UMAP split by batch (before/after correction)
|
||
- UMAP colored by cell type
|
||
- Batch mixing statistics
|
||
|
||
7. Save integrated dataset
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## CRISPR Screening
|
||
|
||
### Guide RNA Design
|
||
|
||
```python
|
||
agent.go("""
|
||
Design guide RNAs for CRISPR knockout screening of cell cycle genes:
|
||
|
||
Target genes:
|
||
- CDK1, CDK2, CDK4, CDK6
|
||
- CCNA2, CCNB1, CCND1, CCNE1
|
||
- TP53, RB1, MYC
|
||
|
||
Requirements:
|
||
1. Design 4-6 guides per gene targeting early exons
|
||
|
||
2. For each guide, evaluate:
|
||
- On-target efficiency score (Doench 2016)
|
||
- Off-target potential (CFD score < 0.3)
|
||
- Avoid common SNPs (1000 Genomes)
|
||
|
||
3. Add control guides:
|
||
- 100 non-targeting controls
|
||
- 20 positive controls (essential genes)
|
||
|
||
4. Output:
|
||
- Table with: gene, guide_sequence, PAM, position, on_target_score, off_target_count
|
||
- Sequences in format for oligonucleotide ordering
|
||
- Visual summary of guide distribution per gene
|
||
|
||
Reference genome: hg38
|
||
""")
|
||
```
|
||
|
||
### CRISPR Screen Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Analyze data from a genome-wide CRISPR knockout screen:
|
||
|
||
Data: 'data/crispr_screen_counts.csv'
|
||
- Columns: guide_id, gene, sample_T0, sample_T15, replicate
|
||
- ~80,000 guides targeting ~18,000 genes
|
||
|
||
Analysis:
|
||
1. Quality control:
|
||
- Guide representation (reads per guide)
|
||
- Sample correlation
|
||
- Remove guides with <30 reads in T0
|
||
|
||
2. Normalize counts:
|
||
- Reads per million (RPM)
|
||
- Log2 fold change (T15 vs T0)
|
||
|
||
3. Statistical analysis using MAGeCK:
|
||
- Identify significantly depleted/enriched genes
|
||
- FDR < 0.05
|
||
- Rank genes by robust rank aggregation (RRA)
|
||
|
||
4. Functional analysis:
|
||
- Pathway enrichment of hit genes
|
||
- Known vs novel essential genes
|
||
- Correlation with Cancer Dependency Map
|
||
|
||
5. Visualization:
|
||
- Scatterplot: log2FC vs -log10(FDR)
|
||
- Heatmap: top 50 depleted genes across replicates
|
||
- Network: PPI network of hit genes
|
||
|
||
6. Export:
|
||
- Ranked gene list with statistics
|
||
- Enriched pathways table
|
||
""")
|
||
```
|
||
|
||
### Pooled Optical Screening Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Analyze pooled CRISPR screen with imaging readout (e.g., Cell Painting):
|
||
|
||
Data structure:
|
||
- 'data/guide_assignments.csv': cell_id, guide_id, gene
|
||
- 'data/morphology_features.csv': cell_id, feature_1...feature_500
|
||
|
||
Analysis:
|
||
1. Feature preprocessing:
|
||
- Remove low-variance features
|
||
- Normalize features (z-score per plate)
|
||
- PCA for dimensionality reduction
|
||
|
||
2. Associate phenotypes with perturbations:
|
||
- Aggregate cells by guide (mean/median)
|
||
- Calculate morphological distance from controls
|
||
- Statistical test for phenotype change
|
||
|
||
3. Identify phenotype-altering genes:
|
||
- Mahalanobis distance from control distribution
|
||
- Bonferroni correction for multiple testing
|
||
- Effect size threshold
|
||
|
||
4. Cluster genes by phenotype similarity:
|
||
- Hierarchical clustering of gene profiles
|
||
- Identify phenotype classes
|
||
|
||
5. Validation and interpretation:
|
||
- Compare to known gene functions
|
||
- Pathway enrichment per phenotype cluster
|
||
|
||
6. Visualization:
|
||
- UMAP of all perturbations
|
||
- Heatmap of gene clusters × morphology features
|
||
- Representative images for each cluster
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Genomic Analysis
|
||
|
||
### GWAS Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Perform genome-wide association study for Type 2 Diabetes:
|
||
|
||
Data:
|
||
- 'data/genotypes.bed' (PLINK format, 500K SNPs, 5000 cases, 5000 controls)
|
||
- 'data/phenotypes.txt' (sample_id, T2D_status, age, sex, BMI, ancestry_PCs)
|
||
|
||
Workflow:
|
||
1. Quality control:
|
||
- SNP QC: MAF > 0.01, HWE p > 1e-6, genotyping rate > 0.95
|
||
- Sample QC: genotyping rate > 0.95, heterozygosity check
|
||
- Remove related individuals (kinship > 0.125)
|
||
|
||
2. Association testing:
|
||
- Logistic regression: T2D ~ SNP + age + sex + BMI + PC1-10
|
||
- Genome-wide significance threshold: p < 5e-8
|
||
- Suggestive threshold: p < 1e-5
|
||
|
||
3. Post-GWAS analysis:
|
||
- LD clumping (r² > 0.1, 500kb window)
|
||
- Annotate lead SNPs with nearby genes (±100kb)
|
||
- Query GWAS Catalog for known associations
|
||
|
||
4. Functional annotation:
|
||
- Overlap with regulatory elements (ENCODE)
|
||
- eQTL colocalization (GTEx)
|
||
- GWAS prioritization scores (PoPS, ABC)
|
||
|
||
5. Visualization:
|
||
- Manhattan plot
|
||
- QQ plot
|
||
- Regional association plots for top loci
|
||
- Locus zoom plots
|
||
|
||
6. Heritability and genetic correlation:
|
||
- SNP heritability (LDSC)
|
||
- Genetic correlation with related traits
|
||
|
||
Export summary statistics for meta-analysis
|
||
""")
|
||
```
|
||
|
||
### Whole Exome Sequencing Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Analyze whole exome sequencing data for rare disease diagnosis:
|
||
|
||
Data: Family trio (proband, mother, father)
|
||
- 'data/proband.bam'
|
||
- 'data/mother.bam'
|
||
- 'data/father.bam'
|
||
|
||
Phenotype: Developmental delay, seizures, intellectual disability
|
||
|
||
Pipeline:
|
||
1. Variant calling:
|
||
- GATK HaplotypeCaller on each sample
|
||
- Joint genotyping across trio
|
||
- VQSR filtering (SNPs and indels separately)
|
||
|
||
2. Variant annotation:
|
||
- Functional consequence (VEP or ANNOVAR)
|
||
- Population frequencies (gnomAD)
|
||
- Pathogenicity predictions (CADD, REVEL, SpliceAI)
|
||
- Disease databases (ClinVar, OMIM)
|
||
|
||
3. Inheritance analysis:
|
||
- De novo variants (absent in both parents)
|
||
- Compound heterozygous variants
|
||
- Rare homozygous variants (autozygosity)
|
||
- X-linked variants (if proband is male)
|
||
|
||
4. Filtering strategy:
|
||
- Population AF < 0.001 (gnomAD)
|
||
- High-quality variants (GQ > 20, DP > 10)
|
||
- Loss-of-function or missense with CADD > 20
|
||
- Match phenotype to gene function (HPO terms)
|
||
|
||
5. Prioritization:
|
||
- Known disease genes for phenotype
|
||
- De novo in intolerant genes (pLI > 0.9)
|
||
- Protein-truncating variants
|
||
|
||
6. Report:
|
||
- Top candidate variants with evidence
|
||
- Gene function and disease association
|
||
- Segregation analysis
|
||
- Recommended validation (Sanger sequencing)
|
||
- ACMG variant classification
|
||
|
||
Save VCF with annotations and prioritized candidate list
|
||
""")
|
||
```
|
||
|
||
### Variant Calling from RNA-seq
|
||
|
||
```python
|
||
agent.go("""
|
||
Identify expressed variants from RNA-seq data:
|
||
|
||
Data: Tumor RNA-seq BAM file
|
||
- 'data/tumor_RNAseq.bam'
|
||
- Reference: hg38
|
||
|
||
Purpose: Identify expressed somatic mutations for neoantigen prediction
|
||
|
||
Steps:
|
||
1. Pre-processing:
|
||
- Mark duplicates (Picard)
|
||
- Split reads at junctions (GATK SplitNCigarReads)
|
||
- Base quality recalibration
|
||
|
||
2. Variant calling:
|
||
- GATK HaplotypeCaller (RNA-seq mode)
|
||
- Filter: DP > 10, AF > 0.05
|
||
|
||
3. Filtering artifacts:
|
||
- Remove common SNPs (gnomAD AF > 0.001)
|
||
- Filter intronic/intergenic variants
|
||
- Remove known RNA editing sites (RADAR database)
|
||
- Panel of normals (if available)
|
||
|
||
4. Annotation:
|
||
- Functional impact (VEP)
|
||
- Identify non-synonymous variants
|
||
- Predict MHC binding (NetMHCpan for patient HLA type)
|
||
|
||
5. Prioritize neoantigens:
|
||
- Strong MHC binding (IC50 < 500nM)
|
||
- High expression (TPM > 5)
|
||
- High variant allele frequency
|
||
|
||
6. Output:
|
||
- Annotated VCF
|
||
- Neoantigen candidates table
|
||
- Peptide sequences for validation
|
||
|
||
This requires patient HLA typing data
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Protein Structure and Function
|
||
|
||
### Protein Structure Prediction and Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Predict and analyze structure for novel protein sequence:
|
||
|
||
Sequence (FASTA format):
|
||
>Novel_Kinase_Domain
|
||
MKLLVVDDDGVADYSKRDGAFMVAYCIEPGDG...
|
||
|
||
Tasks:
|
||
1. Structure prediction:
|
||
- Use AlphaFold2 or ESMFold
|
||
- Generate 5 models, rank by confidence
|
||
|
||
2. Quality assessment:
|
||
- pLDDT scores (per-residue confidence)
|
||
- pTM score (global confidence)
|
||
- Identify low-confidence regions
|
||
|
||
3. Domain identification:
|
||
- InterProScan for domain architecture
|
||
- Pfam domain search
|
||
- Identify catalytic residues
|
||
|
||
4. Functional site prediction:
|
||
- Active site prediction
|
||
- Substrate binding pocket identification
|
||
- Post-translational modification sites
|
||
|
||
5. Structural alignment:
|
||
- Search for similar structures (PDB)
|
||
- Align to close homologs
|
||
- Identify conserved structural motifs
|
||
|
||
6. Mutation analysis:
|
||
- Known disease mutations in homologs
|
||
- Predict impact on structure (Rosetta ddG)
|
||
|
||
7. Visualization and output:
|
||
- PyMOL/Chimera visualization scripts
|
||
- Structural alignment figures
|
||
- Annotated PDB file with functional sites
|
||
- Summary report with predictions
|
||
""")
|
||
```
|
||
|
||
### Protein-Protein Interaction Prediction
|
||
|
||
```python
|
||
agent.go("""
|
||
Predict and validate protein-protein interactions:
|
||
|
||
Target protein: BRCA1
|
||
Species: Human
|
||
|
||
Analysis:
|
||
1. Literature-based interactions:
|
||
- Query BioGRID, STRING, IntAct databases
|
||
- Extract high-confidence interactors (score > 0.7)
|
||
|
||
2. Structure-based prediction:
|
||
- Predict BRCA1 structure (if not available)
|
||
- Dock with known interactors (BRCA2, BARD1, etc.)
|
||
- Score interfaces (PISA, PDBePISA)
|
||
|
||
3. Sequence-based prediction:
|
||
- Coevolution analysis (EVcouplings)
|
||
- Domain-domain interaction prediction
|
||
- Linear motif search (ELM database)
|
||
|
||
4. Functional analysis of interactors:
|
||
- GO enrichment analysis
|
||
- KEGG pathway membership
|
||
- Tissue/cell type expression patterns
|
||
|
||
5. Network analysis:
|
||
- Build PPI network
|
||
- Identify network modules
|
||
- Central hub proteins
|
||
|
||
6. Experimental validation suggestions:
|
||
- Prioritize interactions for validation
|
||
- Suggest Co-IP or Y2H experiments
|
||
- Identify commercially available antibodies
|
||
|
||
7. Output:
|
||
- Ranked interaction list with evidence
|
||
- PPI network visualization
|
||
- Structural models of key interactions
|
||
""")
|
||
```
|
||
|
||
### Protein Engineering Design
|
||
|
||
```python
|
||
agent.go("""
|
||
Design improved enzyme variant with enhanced thermostability:
|
||
|
||
Target enzyme: TEM-1 β-lactamase
|
||
Goal: Increase melting temperature by >10°C while maintaining activity
|
||
|
||
Strategy:
|
||
1. Analyze current structure:
|
||
- Load PDB structure (1BTL)
|
||
- Identify flexible regions (B-factors)
|
||
- Find potential disulfide bond sites
|
||
|
||
2. Computational design:
|
||
- Rosetta design simulations
|
||
- Identify stabilizing mutations (ΔΔG < -1.0 kcal/mol)
|
||
- Avoid active site and substrate binding regions
|
||
|
||
3. Prioritize mutations:
|
||
- Surface entropy reduction (SER)
|
||
- Disulfide bond introduction
|
||
- Salt bridge formation
|
||
- Hydrophobic core packing
|
||
|
||
4. Check conservation:
|
||
- Multiple sequence alignment of β-lactamases
|
||
- Avoid highly conserved positions
|
||
- Prefer positions with natural variation
|
||
|
||
5. Design library:
|
||
- Rank top 20 single mutants
|
||
- Design 5 combinatorial variants (2-3 mutations)
|
||
- Ensure codon optimization for E. coli
|
||
|
||
6. Validation plan:
|
||
- Expression and purification protocol
|
||
- Thermal shift assay (DSF)
|
||
- Activity assay (nitrocefin)
|
||
- Recommended high-throughput screening
|
||
|
||
7. Output:
|
||
- Ranked mutation list with predicted ΔΔG
|
||
- Structural visualizations
|
||
- Codon-optimized sequences
|
||
- Cloning primers
|
||
- Experimental validation protocol
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Drug Discovery and ADMET
|
||
|
||
### Virtual Screening
|
||
|
||
```python
|
||
agent.go("""
|
||
Perform virtual screening for SARS-CoV-2 Mpro inhibitors:
|
||
|
||
Target: SARS-CoV-2 Main protease (Mpro)
|
||
Crystal structure: PDB 6LU7
|
||
|
||
Compound library: ZINC15 drug-like subset (~100K compounds)
|
||
File: 'data/zinc_druglike_100k.smi' (SMILES format)
|
||
|
||
Workflow:
|
||
1. Protein preparation:
|
||
- Remove crystallographic waters (keep catalytic waters)
|
||
- Add hydrogens, optimize H-bond network
|
||
- Define binding site (residues within 5Å of native ligand)
|
||
|
||
2. Ligand preparation:
|
||
- Generate 3D coordinates from SMILES
|
||
- Enumerate tautomers and protonation states
|
||
- Energy minimization
|
||
|
||
3. Molecular docking:
|
||
- Dock all compounds (AutoDock Vina or Glide)
|
||
- Generate top 3 poses per compound
|
||
- Score binding affinity
|
||
|
||
4. Consensus scoring:
|
||
- Combine multiple scoring functions
|
||
- Rank compounds by consensus score
|
||
|
||
5. ADMET filtering:
|
||
- Lipinski's rule of 5
|
||
- BBB permeability (not needed for this target)
|
||
- hERG liability (pIC50 > 5)
|
||
- CYP450 inhibition prediction
|
||
- Toxicity prediction (Tox21)
|
||
|
||
6. Visual inspection:
|
||
- Top 100 compounds
|
||
- Check key interactions (His41, Cys145 catalytic dyad)
|
||
- Remove PAINS and frequent hitters
|
||
|
||
7. Final selection:
|
||
- Top 20 compounds for experimental testing
|
||
- Cluster by scaffold diversity
|
||
|
||
8. Output:
|
||
- Ranked compound list with scores and ADMET properties
|
||
- Docking poses (mol2 or PDB format)
|
||
- 2D interaction diagrams
|
||
- Purchase availability from vendors
|
||
""")
|
||
```
|
||
|
||
### ADMET Property Prediction
|
||
|
||
```python
|
||
agent.go("""
|
||
Predict ADMET properties for drug candidate series:
|
||
|
||
Input: 'data/compound_series.smi' (25 analogs, SMILES format)
|
||
Lead scaffold: Novel kinase inhibitor series
|
||
|
||
Properties to predict:
|
||
1. Absorption:
|
||
- Caco-2 permeability
|
||
- Human intestinal absorption (HIA)
|
||
- P-glycoprotein substrate
|
||
|
||
2. Distribution:
|
||
- Plasma protein binding (% bound)
|
||
- Volume of distribution (VDss)
|
||
- Blood-brain barrier permeability (LogBB)
|
||
|
||
3. Metabolism:
|
||
- CYP450 substrate (1A2, 2C9, 2C19, 2D6, 3A4)
|
||
- CYP450 inhibition (same isoforms)
|
||
- Sites of metabolism (SOM prediction)
|
||
|
||
4. Excretion:
|
||
- Clearance estimation
|
||
- Half-life prediction
|
||
- Renal excretion likelihood
|
||
|
||
5. Toxicity:
|
||
- hERG inhibition (cardiotoxicity)
|
||
- AMES mutagenicity
|
||
- Hepatotoxicity
|
||
- Skin sensitization
|
||
- Rat acute toxicity (LD50)
|
||
|
||
6. Drug-likeness:
|
||
- Lipinski's Ro5
|
||
- QED score
|
||
- Synthetic accessibility
|
||
|
||
Analysis:
|
||
- Compare all analogs in the series
|
||
- Structure-property relationships
|
||
- Identify best balanced compound
|
||
- Suggest modifications for improvement
|
||
|
||
Output:
|
||
- Comprehensive ADMET table
|
||
- Radar plots for each compound
|
||
- SAR analysis for each property
|
||
- Recommendations for next design iteration
|
||
""")
|
||
```
|
||
|
||
### Lead Optimization
|
||
|
||
```python
|
||
agent.go("""
|
||
Optimize lead compound balancing potency and selectivity:
|
||
|
||
Current lead:
|
||
- IC50 (target kinase): 50 nM
|
||
- IC50 (off-target kinases): 100-500 nM (poor selectivity)
|
||
- Microsomal stability: t1/2 = 20 min (too short)
|
||
- Solubility: 5 μM (low)
|
||
|
||
Goal: Maintain potency, improve selectivity (>100x), improve PK properties
|
||
|
||
Strategy:
|
||
1. Analyze current binding mode:
|
||
- Docking to target and off-targets
|
||
- Identify selectivity-determining residues
|
||
- Map interaction hotspots
|
||
|
||
2. Design focused library:
|
||
- Modifications to improve selectivity:
|
||
* Target residues unique to on-target
|
||
* Avoid conserved kinase regions
|
||
- Modifications to improve solubility:
|
||
* Add polar groups to solvent-exposed regions
|
||
* Replace lipophilic groups
|
||
- Modifications to improve metabolic stability:
|
||
* Block metabolically labile positions
|
||
* Replace metabolically unstable groups
|
||
|
||
3. Virtual enumeration:
|
||
- Generate ~200 analogs
|
||
- Predict binding affinity (docking)
|
||
- Predict ADMET properties
|
||
|
||
4. Multi-parameter optimization:
|
||
- Calculate MPO score (potency + selectivity + ADMET)
|
||
- Pareto optimization
|
||
- Select top 20 compounds
|
||
|
||
5. Clustering and diversity:
|
||
- Ensure structural diversity
|
||
- Test different modification strategies
|
||
|
||
6. Synthetic feasibility:
|
||
- Retrosynthetic analysis
|
||
- Flag difficult syntheses
|
||
- Prioritize 10 compounds for synthesis
|
||
|
||
7. Deliverables:
|
||
- Ranked compound designs
|
||
- Predicted properties table
|
||
- Binding mode visualizations
|
||
- Synthetic routes
|
||
- Recommended testing cascade
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Pathway and Network Analysis
|
||
|
||
### Pathway Enrichment Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Perform comprehensive pathway enrichment on differentially expressed genes:
|
||
|
||
Input: 'data/DEGs.csv'
|
||
Columns: gene_symbol, log2FC, padj
|
||
Significant DEGs: padj < 0.05, |log2FC| > 1
|
||
Total: 450 upregulated, 380 downregulated genes
|
||
|
||
Background: all detected genes in the experiment (~15,000)
|
||
|
||
Analysis:
|
||
1. GO enrichment (biological processes):
|
||
- Test upregulated and downregulated genes separately
|
||
- Use hypergeometric test
|
||
- FDR correction (Benjamini-Hochberg)
|
||
- Filter: padj < 0.05, fold enrichment > 2
|
||
|
||
2. KEGG pathway enrichment:
|
||
- Same approach as GO
|
||
- Focus on signaling and metabolic pathways
|
||
|
||
3. Reactome pathway enrichment:
|
||
- More detailed pathway hierarchy
|
||
|
||
4. Disease association:
|
||
- DisGeNET disease enrichment
|
||
- Compare to disease gene signatures (MSigDB)
|
||
|
||
5. Transcription factor enrichment:
|
||
- Predict upstream regulators (ChEA3)
|
||
- ENCODE ChIP-seq enrichment
|
||
|
||
6. Drug/compound perturbations:
|
||
- L1000 connectivity map
|
||
- Identify drugs that reverse/mimic signature
|
||
|
||
7. Cross-pathway analysis:
|
||
- Pathway crosstalk
|
||
- Hierarchical clustering of pathways by gene overlap
|
||
- Network visualization of enriched pathways
|
||
|
||
8. Visualization:
|
||
- Dot plots (GO, KEGG, Reactome)
|
||
- Enrichment map network
|
||
- Chord diagram (genes-pathways)
|
||
- Treemap of hierarchical GO terms
|
||
|
||
9. Export:
|
||
- All enrichment tables
|
||
- Pathway gene lists
|
||
- Interactive HTML report
|
||
""")
|
||
```
|
||
|
||
### Protein-Protein Interaction Network
|
||
|
||
```python
|
||
agent.go("""
|
||
Build and analyze PPI network for Alzheimer's disease genes:
|
||
|
||
Seed genes: Known AD risk genes (APP, PSEN1, PSEN2, APOE, MAPT, etc.)
|
||
File: 'data/AD_seed_genes.txt'
|
||
|
||
Network construction:
|
||
1. Build network from seed genes:
|
||
- Query STRING database (confidence > 0.7)
|
||
- Include direct and second-degree interactors
|
||
- Maximum network size: 500 proteins
|
||
|
||
2. Network enrichment:
|
||
- Add disease associations (DisGeNET)
|
||
- Add tissue expression (GTEx - prioritize brain)
|
||
- Add functional annotations (GO, Reactome)
|
||
|
||
3. Network analysis:
|
||
- Calculate centrality measures:
|
||
* Degree centrality
|
||
* Betweenness centrality
|
||
* Eigenvector centrality
|
||
- Identify hub proteins
|
||
- Community detection (Louvain algorithm)
|
||
|
||
4. Module analysis:
|
||
- Functional enrichment per community
|
||
- Identify disease-relevant modules
|
||
- Key bridge proteins between modules
|
||
|
||
5. Druggability analysis:
|
||
- Identify druggable targets (DGIdb)
|
||
- Known drugs targeting network proteins
|
||
- Clinical trial status
|
||
|
||
6. Network perturbation:
|
||
- Simulate gene knockout
|
||
- Network robustness analysis
|
||
- Identify critical nodes
|
||
|
||
7. Visualization:
|
||
- Interactive network (Cytoscape format)
|
||
- Layout by module membership
|
||
- Color by centrality/expression
|
||
- Size by degree
|
||
|
||
8. Prioritization:
|
||
- Rank proteins by:
|
||
* Network centrality
|
||
* Brain expression
|
||
* Druggability
|
||
* Genetic evidence (GWAS)
|
||
- Top therapeutic targets
|
||
|
||
Output:
|
||
- Network file (graphML, SIF)
|
||
- Module membership table
|
||
- Prioritized target list
|
||
- Druggable targets with existing compounds
|
||
""")
|
||
```
|
||
|
||
### Gene Regulatory Network Inference
|
||
|
||
```python
|
||
agent.go("""
|
||
Infer gene regulatory network from scRNA-seq data:
|
||
|
||
Data: 'data/development_timecourse.h5ad'
|
||
- Cells from 5 developmental timepoints
|
||
- 3000 HVGs quantified
|
||
|
||
Goal: Identify TF→target relationships during development
|
||
|
||
Methods:
|
||
1. Preprocessing:
|
||
- Select TFs (from TF census list)
|
||
- Select potential target genes (HVGs)
|
||
- Normalize expression
|
||
|
||
2. GRN inference using multiple methods:
|
||
- GENIE3 (random forest)
|
||
- SCENIC (motif-based)
|
||
- CellOracle (perturbation-based)
|
||
- Pearson/Spearman correlation (baseline)
|
||
|
||
3. Integrate predictions:
|
||
- Combine scores from multiple methods
|
||
- Weight by motif evidence (JASPAR)
|
||
- Filter low-confidence edges
|
||
|
||
4. Network refinement:
|
||
- Remove indirect edges (transitive reduction)
|
||
- Validate with ChIP-seq data (if available)
|
||
- Literature validation (TRRUST database)
|
||
|
||
5. Dynamic network analysis:
|
||
- TF activity per timepoint/cell state
|
||
- Identify stage-specific regulators
|
||
- Find regulatory switches
|
||
|
||
6. Downstream analysis:
|
||
- Master regulators (high out-degree)
|
||
- Regulatory cascades
|
||
- Feed-forward loops
|
||
- Coherent vs incoherent motifs
|
||
|
||
7. Experimental validation priorities:
|
||
- Rank TF→target edges for validation
|
||
- Suggest ChIP-seq or CUT&RUN experiments
|
||
- Suggest perturbation experiments (knockout/CRISPRi)
|
||
|
||
8. Visualization:
|
||
- Full GRN network (Cytoscape)
|
||
- Key TF subnetworks
|
||
- TF activity heatmap across development
|
||
- Sankey diagram of regulatory flow
|
||
|
||
Output:
|
||
- Edge list with confidence scores
|
||
- TF activity matrix
|
||
- Validated vs novel interactions
|
||
- Prioritized validation experiments
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Disease Classification
|
||
|
||
### Cancer Type Classification from Gene Expression
|
||
|
||
```python
|
||
agent.go("""
|
||
Build multi-class classifier for cancer type prediction:
|
||
|
||
Data: TCGA pan-cancer RNA-seq data
|
||
- Training: 8000 samples across 33 cancer types
|
||
- Expression: 'data/tcga_expression.csv' (samples × genes)
|
||
- Labels: 'data/tcga_labels.csv' (sample_id, cancer_type)
|
||
|
||
Task: Classify tumor samples by cancer type
|
||
|
||
Pipeline:
|
||
1. Data preprocessing:
|
||
- Log2(TPM + 1) transformation
|
||
- Remove low-variance genes (variance < 0.1)
|
||
- Z-score normalization
|
||
|
||
2. Feature selection:
|
||
- Variance filtering (top 5000 genes)
|
||
- Univariate feature selection (ANOVA F-test)
|
||
- Select top 500 features
|
||
|
||
3. Train-test split:
|
||
- 80% train, 20% test
|
||
- Stratified by cancer type
|
||
|
||
4. Model training (compare multiple algorithms):
|
||
- Random Forest
|
||
- Gradient Boosting (XGBoost)
|
||
- Neural Network (MLP)
|
||
- Elastic Net logistic regression
|
||
|
||
5. Model evaluation:
|
||
- Accuracy, precision, recall per class
|
||
- Confusion matrix
|
||
- ROC curves (one-vs-rest)
|
||
- Feature importance ranking
|
||
|
||
6. Model interpretation:
|
||
- SHAP values for predictions
|
||
- Top predictive genes per cancer type
|
||
- Pathway enrichment of predictive features
|
||
|
||
7. Clinical validation:
|
||
- Test on independent dataset (if available)
|
||
- Analyze misclassifications
|
||
- Identify hard-to-classify subtypes
|
||
|
||
8. Deliverables:
|
||
- Trained model (pickle)
|
||
- Performance metrics report
|
||
- Feature importance table
|
||
- Confusion matrix heatmap
|
||
- Prediction script for new samples
|
||
""")
|
||
```
|
||
|
||
### Disease Risk Prediction from Multi-Omics
|
||
|
||
```python
|
||
agent.go("""
|
||
Develop integrative model predicting cardiovascular disease risk:
|
||
|
||
Data sources:
|
||
1. Genotypes: 'data/genotypes.csv' (500K SNPs, polygenic risk scores)
|
||
2. Clinical: 'data/clinical.csv' (age, sex, BMI, blood pressure, cholesterol)
|
||
3. Proteomics: 'data/proteomics.csv' (200 plasma proteins, Olink panel)
|
||
4. Metabolomics: 'data/metabolomics.csv' (150 metabolites)
|
||
|
||
Outcome: 10-year CVD incidence (binary)
|
||
- Cases: 800
|
||
- Controls: 3200
|
||
|
||
Approach:
|
||
1. Data preprocessing:
|
||
- Impute missing values (missForest)
|
||
- Transform skewed features (log/Box-Cox)
|
||
- Normalize each omics layer separately
|
||
|
||
2. Feature engineering:
|
||
- Calculate PRS from SNP data
|
||
- Interaction terms (age × metabolites, etc.)
|
||
- Metabolite ratios (known CVD markers)
|
||
|
||
3. Feature selection per omics:
|
||
- Lasso for each data type
|
||
- Select informative features
|
||
|
||
4. Integration strategies (compare):
|
||
- Early integration: concatenate all features
|
||
- Late integration: separate models, combine predictions
|
||
- Intermediate integration: Multi-omics factor analysis (MOFA)
|
||
|
||
5. Model development:
|
||
- Logistic regression (interpretable baseline)
|
||
- Random Forest
|
||
- Elastic Net
|
||
- Neural network with omics-specific layers
|
||
|
||
6. Cross-validation:
|
||
- 5-fold CV, stratified
|
||
- Hyperparameter tuning
|
||
- Calculate confidence intervals
|
||
|
||
7. Model evaluation:
|
||
- AUC-ROC, AUC-PR
|
||
- Calibration plots
|
||
- Net reclassification improvement (NRI)
|
||
- Compare to clinical models (Framingham, SCORE)
|
||
|
||
8. Interpretation:
|
||
- Feature importance (permutation importance)
|
||
- SHAP values for individuals
|
||
- Identify most informative omics layer
|
||
|
||
9. Clinical utility:
|
||
- Decision curve analysis
|
||
- Risk stratification groups
|
||
- Biomarker panel selection
|
||
|
||
Outputs:
|
||
- Model comparison table
|
||
- ROC curves all models
|
||
- Feature importance per omics
|
||
- Reclassification table
|
||
- Clinical implementation recommendations
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Multi-Omics Integration
|
||
|
||
### Multi-Omics Data Integration
|
||
|
||
```python
|
||
agent.go("""
|
||
Integrate transcriptomics, proteomics, and metabolomics data:
|
||
|
||
Study: Drug response in cancer cell lines
|
||
Data:
|
||
- RNA-seq: 'data/transcriptomics.csv' (15000 genes × 50 cell lines)
|
||
- Proteomics: 'data/proteomics.csv' (3000 proteins × 50 cell lines)
|
||
- Metabolomics: 'data/metabolomics.csv' (200 metabolites × 50 cell lines)
|
||
- Drug response: 'data/drug_response.csv' (cell line, drug, IC50)
|
||
|
||
Goal: Identify multi-omics signatures of drug sensitivity
|
||
|
||
Analysis:
|
||
1. Data preprocessing:
|
||
- Match samples across omics layers
|
||
- Filter low-variance features per omics
|
||
- Normalize each omics separately (z-score)
|
||
|
||
2. Integration methods (compare):
|
||
|
||
**Method 1: MOFA (Multi-Omics Factor Analysis)**
|
||
- Identify latent factors capturing variance across omics
|
||
- Determine factor contributions per omics
|
||
- Relate factors to drug response
|
||
|
||
**Method 2: DIABLO (sparse PLS-DA)**
|
||
- Supervised integration
|
||
- Maximize covariance between omics and drug response
|
||
- Select features from each omics layer
|
||
|
||
**Method 3: Similarity Network Fusion (SNF)**
|
||
- Build patient similarity networks per omics
|
||
- Fuse networks
|
||
- Cluster cell lines by integrated similarity
|
||
|
||
3. Association with drug response:
|
||
- Correlation of factors/components with IC50
|
||
- Identify drug-sensitive vs resistant groups
|
||
- Multi-omics biomarkers
|
||
|
||
4. Network analysis:
|
||
- Build multi-layer network:
|
||
* Gene regulatory network (RNA)
|
||
* Protein-protein interactions (proteins)
|
||
* Gene-metabolite associations
|
||
- Integrate layers
|
||
- Find dysregulated pathways
|
||
|
||
5. Predictive modeling:
|
||
- Train model predicting drug response from multi-omics
|
||
- Compare: using all omics vs individual omics
|
||
- Feature selection across omics
|
||
|
||
6. Biological interpretation:
|
||
- Map features to pathways
|
||
- Identify mechanism of drug action
|
||
- Suggest combination therapies
|
||
|
||
7. Validation:
|
||
- Leave-one-out cross-validation
|
||
- Test in independent cell line panel
|
||
|
||
Outputs:
|
||
- Factor loadings per omics (MOFA)
|
||
- Multi-omics biomarker signature
|
||
- Integrated network visualization
|
||
- Predictive model of drug response
|
||
- Mechanistic hypotheses
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Proteomics Analysis
|
||
|
||
### Label-Free Quantitative Proteomics
|
||
|
||
```python
|
||
agent.go("""
|
||
Analyze label-free proteomics data from mass spectrometry:
|
||
|
||
Study: Comparison of normal vs diseased tissue (n=6 per group)
|
||
Data: MaxQuant output
|
||
- 'data/proteinGroups.txt' (MaxQuant protein quantification)
|
||
- 'data/peptides.txt' (peptide-level data)
|
||
|
||
Experimental design:
|
||
- 6 normal samples
|
||
- 6 disease samples
|
||
- TMT-labeled, 3 fractions each
|
||
|
||
Analysis:
|
||
1. Data loading and QC:
|
||
- Load proteinGroups.txt
|
||
- Remove contaminants, reverse hits
|
||
- Filter: valid values in ≥50% of samples per group
|
||
- Check sample correlations and outliers
|
||
- PCA for quality assessment
|
||
|
||
2. Imputation:
|
||
- Impute missing values (MAR vs MNAR approach)
|
||
- Use MinProb for low-abundance missing values
|
||
- Use kNN for random missing values
|
||
|
||
3. Normalization:
|
||
- Median normalization
|
||
- Or: VSN (variance stabilizing normalization)
|
||
|
||
4. Differential expression:
|
||
- Two-sample t-test (for each protein)
|
||
- Moderated t-test (limma)
|
||
- Filter: |log2FC| > 0.58 (~1.5-fold), adj.p < 0.05
|
||
|
||
5. Visualization:
|
||
- Volcano plot
|
||
- Heatmap of significant proteins
|
||
- PCA colored by condition
|
||
- Intensity distributions (before/after normalization)
|
||
|
||
6. Functional enrichment:
|
||
- GO enrichment (up and down separately)
|
||
- KEGG pathways
|
||
- Reactome pathways
|
||
- STRING PPI network of DEPs
|
||
|
||
7. PTM analysis (if available):
|
||
- Phosphorylation site analysis
|
||
- Kinase enrichment analysis (KEA3)
|
||
|
||
8. Orthogonal validation:
|
||
- Compare to RNA-seq data (if available)
|
||
- Protein-RNA correlation
|
||
- Identify discordant genes
|
||
|
||
9. Biomarker candidates:
|
||
- Rank proteins by fold-change and significance
|
||
- Filter for secreted proteins (potential biomarkers)
|
||
- Check if targetable (druggable)
|
||
|
||
Outputs:
|
||
- Differential abundance table
|
||
- QC report with plots
|
||
- Enrichment analysis results
|
||
- PPI network of DEPs
|
||
- Candidate biomarkers list
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Biomarker Discovery
|
||
|
||
### Diagnostic Biomarker Discovery
|
||
|
||
```python
|
||
agent.go("""
|
||
Discover diagnostic biomarkers for early cancer detection:
|
||
|
||
Study: Plasma proteomics comparing early-stage cancer vs healthy controls
|
||
Data:
|
||
- 'data/proteomics.csv' (1000 proteins × 200 samples)
|
||
- 'data/metadata.csv' (sample_id, group [cancer/healthy], age, sex)
|
||
|
||
Groups:
|
||
- Early-stage cancer: 100 samples
|
||
- Healthy controls: 100 samples
|
||
|
||
Goal: Identify protein panel for early detection (target AUC > 0.90)
|
||
|
||
Workflow:
|
||
1. Exploratory analysis:
|
||
- PCA, tSNE to visualize separation
|
||
- Univariate differential abundance
|
||
- Volcano plot
|
||
|
||
2. Feature selection:
|
||
- Rank proteins by:
|
||
* Fold change
|
||
* Statistical significance (t-test, Mann-Whitney)
|
||
* AUC (each protein individually)
|
||
- Select proteins with AUC > 0.70
|
||
|
||
3. Biomarker panel construction:
|
||
- Correlation analysis (remove redundant markers)
|
||
- Forward selection:
|
||
* Start with best single marker
|
||
* Add markers improving panel performance
|
||
* Stop when no improvement
|
||
- Aim for 5-10 marker panel (practical for assay)
|
||
|
||
4. Model building:
|
||
- Logistic regression on selected panel
|
||
- Calculate combined risk score
|
||
- Cross-validation (10-fold)
|
||
|
||
5. Performance evaluation:
|
||
- AUC-ROC, AUC-PR
|
||
- Sensitivity/specificity at different thresholds
|
||
- Clinical decision threshold (e.g., 90% sensitivity)
|
||
- Calibration plot
|
||
|
||
6. Biological validation:
|
||
- Literature support for cancer association
|
||
- Expression in tumor vs blood
|
||
- Mechanism of release/shedding
|
||
|
||
7. Clinical utility:
|
||
- Compare to existing biomarkers (CEA, CA19-9, etc.)
|
||
- Cost-effectiveness consideration
|
||
- Assay feasibility (ELISA, MRM, etc.)
|
||
|
||
8. Independent validation plan:
|
||
- Power calculation for validation cohort
|
||
- Suggested sample size
|
||
- Pre-analytical variables to control
|
||
|
||
Outputs:
|
||
- Ranked protein list with individual performance
|
||
- Final biomarker panel
|
||
- Logistic regression model
|
||
- ROC curves (individual + panel)
|
||
- Clinical characteristics table
|
||
- Validation study protocol
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Additional Advanced Examples
|
||
|
||
### Spatial Transcriptomics Analysis
|
||
|
||
```python
|
||
agent.go("""
|
||
Analyze Visium spatial transcriptomics data:
|
||
|
||
Data: 'data/visium_brain_tumor.h5ad'
|
||
- Contains spatial coordinates and gene expression
|
||
- Tissue: Brain tumor biopsy
|
||
|
||
Analysis:
|
||
1. Data QC and normalization:
|
||
- Filter low-quality spots (total counts, detected genes)
|
||
- Normalize, log-transform
|
||
- Calculate spatial statistics
|
||
|
||
2. Spatial clustering:
|
||
- Graph-based clustering considering spatial proximity
|
||
- Identify tumor regions, stroma, necrosis, etc.
|
||
|
||
3. Spatially variable genes:
|
||
- Test for spatial patterns (Moran's I, SpatialDE)
|
||
- Identify genes with spatial gradients
|
||
|
||
4. Deconvolution:
|
||
- Estimate cell type composition per spot
|
||
- Use scRNA-seq reference (if available)
|
||
- Methods: Cell2location, RCTD, SPOTlight
|
||
|
||
5. Niche analysis:
|
||
- Define tissue niches by cell type composition
|
||
- Identify tumor-stroma interface
|
||
- Analyze cell-cell interactions
|
||
|
||
6. Spatial pathway analysis:
|
||
- Map pathway activity onto tissue
|
||
- Identify spatially localized processes
|
||
|
||
7. Visualization:
|
||
- Spatial plots colored by cluster, gene expression
|
||
- Cell type composition maps
|
||
- Pathway activity maps
|
||
|
||
Output:
|
||
- Annotated spatial data object
|
||
- Spatially variable gene list
|
||
- Cell type composition per spot
|
||
- Niche definitions and cell-cell interactions
|
||
""")
|
||
```
|
||
|
||
---
|
||
|
||
## Tips for Effective Task Specification
|
||
|
||
### 1. Be Specific About Data Formats and Locations
|
||
|
||
✅ Good:
|
||
```python
|
||
agent.go("Analyze scRNA-seq data in AnnData format at 'data/experiment1.h5ad'")
|
||
```
|
||
|
||
❌ Vague:
|
||
```python
|
||
agent.go("Analyze my data")
|
||
```
|
||
|
||
### 2. Specify Analysis Parameters
|
||
|
||
✅ Good:
|
||
```python
|
||
agent.go("""
|
||
Cluster cells using Leiden algorithm with resolution 0.5,
|
||
k-neighbors=10, using PCA components 1-30
|
||
""")
|
||
```
|
||
|
||
❌ Vague:
|
||
```python
|
||
agent.go("Cluster the cells")
|
||
```
|
||
|
||
### 3. Request Specific Outputs
|
||
|
||
✅ Good:
|
||
```python
|
||
agent.go("""
|
||
... and save results as:
|
||
- CSV table with statistics
|
||
- PNG figures at 300 DPI
|
||
- Processed data as AnnData at 'results/processed.h5ad'
|
||
""")
|
||
```
|
||
|
||
❌ Vague:
|
||
```python
|
||
agent.go("... and save the results")
|
||
```
|
||
|
||
### 4. Provide Biological Context
|
||
|
||
✅ Good:
|
||
```python
|
||
agent.go("""
|
||
This is a drug treatment experiment. Compare vehicle vs treated groups
|
||
to identify drug-induced transcriptional changes. Focus on apoptosis and
|
||
cell cycle pathways.
|
||
""")
|
||
```
|
||
|
||
❌ Vague:
|
||
```python
|
||
agent.go("Compare the two groups")
|
||
```
|
||
|
||
### 5. Break Complex Analyses into Steps
|
||
|
||
✅ Good:
|
||
```python
|
||
# Step 1
|
||
agent.go("Load and QC the data, save QC metrics")
|
||
|
||
# Step 2
|
||
agent.go("Based on QC, normalize and find HVGs")
|
||
|
||
# Step 3
|
||
agent.go("Cluster and annotate cell types")
|
||
```
|
||
|
||
❌ Overwhelming:
|
||
```python
|
||
agent.go("Do a complete scRNA-seq analysis pipeline")
|
||
```
|