Files
claude-scientific-skills/docs/examples.md
Timothy Kassis 5b7081cbff Add examples
2025-11-06 17:04:04 -08:00

89 KiB
Raw Permalink Blame History

Real-World Scientific Examples

This document provides comprehensive, practical examples demonstrating how to combine Claude Scientific Skills to solve real scientific problems across multiple domains.


📋 Table of Contents

  1. Drug Discovery & Medicinal Chemistry
  2. Cancer Genomics & Precision Medicine
  3. Single-Cell Transcriptomics
  4. Protein Structure & Function
  5. Chemical Safety & Toxicology
  6. Clinical Trial Analysis
  7. Metabolomics & Systems Biology
  8. Materials Science & Chemistry
  9. Digital Pathology
  10. Lab Automation & Protocol Design
  11. Agricultural Genomics
  12. Neuroscience & Brain Imaging
  13. Environmental Microbiology
  14. Infectious Disease Research
  15. Multi-Omics Integration
  16. Computational Chemistry & Synthesis
  17. Clinical Research & Real-World Evidence
  18. Experimental Physics & Data Analysis
  19. Chemical Engineering & Process Optimization

Drug Discovery & Medicinal Chemistry

Example 1: Discovery of Novel EGFR Inhibitors for Lung Cancer

Objective: Identify novel small molecule inhibitors of EGFR with improved properties compared to existing drugs.

Skills Used:

  • chembl-database - Query bioactivity data
  • pubchem-database - Search compound libraries
  • rdkit - Analyze molecular properties
  • datamol - Generate analogs
  • diffdock - Molecular docking
  • alphafold-database - Retrieve protein structure
  • pubmed-database - Literature review
  • cosmic-database - Query mutations
  • deepchem - Property prediction
  • scientific-visualization - Create figures

Workflow:

# Always use available 'skills' when possible. Keep the output organized.

Step 1: Query ChEMBL for known EGFR inhibitors with high potency
- Search for compounds targeting EGFR (CHEMBL203)
- Filter: IC50 < 50 nM, pChEMBL value > 7
- Extract SMILES strings and activity data
- Export to DataFrame for analysis

Step 2: Analyze structure-activity relationships
- Load compounds into RDKit
- Calculate molecular descriptors (MW, LogP, TPSA, HBD, HBA)
- Generate Morgan fingerprints (radius=2, 2048 bits)
- Perform hierarchical clustering to identify scaffolds
- Visualize top scaffolds with activity annotations

Step 3: Identify resistance mutations from COSMIC
- Query COSMIC for EGFR mutations in lung cancer
- Focus on gatekeeper mutations (T790M, C797S)
- Extract mutation frequencies and clinical significance
- Cross-reference with literature in PubMed

Step 4: Retrieve EGFR structure from AlphaFold
- Download AlphaFold prediction for EGFR kinase domain
- Alternatively, use experimental structure from PDB (if available)
- Prepare structure for docking (add hydrogens, optimize)

Step 5: Generate novel analogs using datamol
- Select top 5 scaffolds from ChEMBL analysis
- Use scaffold decoration to generate 100 analogs per scaffold
- Apply Lipinski's Rule of Five filtering
- Ensure synthetic accessibility (SA score < 4)
- Check for PAINS and unwanted substructures

Step 6: Predict properties with DeepChem
- Train graph convolutional model on ChEMBL EGFR data
- Predict pIC50 for generated analogs
- Predict ADMET properties (solubility, permeability, hERG)
- Rank candidates by predicted potency and drug-likeness

Step 7: Virtual screening with DiffDock
- Perform molecular docking on top 50 candidates
- Dock into wild-type EGFR and T790M mutant
- Calculate binding energies and interaction patterns
- Identify compounds with favorable binding to both forms

Step 8: Search PubChem for commercial availability
- Query PubChem for top 10 candidates by InChI key
- Check supplier information and purchasing options
- Identify close analogs if exact matches unavailable

Step 9: Literature validation with PubMed
- Search for any prior art on top scaffolds
- Query: "[scaffold_name] AND EGFR AND inhibitor"
- Summarize relevant findings and potential liabilities

Step 10: Create comprehensive report
- Generate 2D structure visualizations of top hits
- Create scatter plots: MW vs LogP, TPSA vs potency
- Produce binding pose figures for top 3 compounds
- Generate table comparing properties to approved drugs (gefitinib, erlotinib)
- Write scientific summary with methodology, results, and recommendations
- Export to PDF with proper citations

Expected Output: 
- Ranked list of 10-20 novel EGFR inhibitor candidates
- Predicted activity and ADMET properties
- Docking poses and binding analysis
- Comprehensive scientific report with publication-quality figures

Example 2: Drug Repurposing for Rare Diseases

Objective: Identify FDA-approved drugs that could be repurposed for treating a rare metabolic disorder.

Skills Used:

  • drugbank-database - Query approved drugs
  • opentargets-database - Target-disease associations
  • string-database - Protein interactions
  • kegg-database - Pathway analysis
  • reactome-database - Pathway enrichment
  • clinicaltrials-database - Check ongoing trials
  • fda-database - Drug approvals and safety
  • networkx - Network analysis
  • literature-review - Systematic review

Workflow:

Step 1: Define disease pathway
- Query KEGG and Reactome for disease-associated pathways
- Identify key proteins and enzymes involved
- Map upstream and downstream pathway components

Step 2: Find protein-protein interactions
- Query STRING database for interaction partners
- Build protein interaction network around key disease proteins
- Identify hub proteins and bottlenecks using NetworkX
- Calculate centrality metrics (betweenness, closeness)

Step 3: Query Open Targets for druggable targets
- Search for targets associated with disease phenotype
- Filter by clinical precedence and tractability
- Prioritize targets with existing approved drugs

Step 4: Search DrugBank for drugs targeting identified proteins
- Query for approved drugs and their targets
- Filter by mechanism of action relevant to disease
- Retrieve drug properties and safety information

Step 5: Query FDA databases for safety profiles
- Check FDA adverse event database (FAERS)
- Review drug labels and black box warnings
- Assess risk-benefit for rare disease population

Step 6: Search ClinicalTrials.gov for prior repurposing attempts
- Query for disease name + drug names
- Check for failed trials (and reasons for failure)
- Identify ongoing trials that may compete

Step 7: Perform pathway enrichment analysis
- Map drug targets to disease pathways
- Calculate enrichment scores with Reactome
- Identify drugs affecting multiple pathway nodes

Step 8: Conduct systematic literature review
- Search PubMed for drug name + disease associations
- Include bioRxiv for recent unpublished findings
- Document any case reports or off-label use
- Use literature-review skill to generate comprehensive review

Step 9: Prioritize candidates
- Rank by: pathway relevance, safety profile, existing evidence
- Consider factors: oral availability, blood-brain barrier penetration
- Assess commercial viability and patent status

Step 10: Generate repurposing report
- Create network visualization of drug-target-pathway relationships
- Generate comparison table of top 5 candidates
- Write detailed rationale for each candidate
- Include mechanism of action diagrams
- Provide recommendations for preclinical validation
- Format as professional PDF with citations

Expected Output:
- Ranked list of 5-10 repurposing candidates
- Network analysis of drug-target-disease relationships
- Safety and efficacy evidence summary
- Repurposing strategy report with next steps

Cancer Genomics & Precision Medicine

Example 3: Clinical Variant Interpretation Pipeline

Objective: Analyze a patient's tumor sequencing data to identify actionable mutations and therapeutic recommendations.

Skills Used:

  • pysam - Parse VCF files
  • ensembl-database - Variant annotation
  • clinvar-database - Clinical significance
  • cosmic-database - Somatic mutations
  • gene-database - Gene information
  • uniprot-database - Protein impact
  • drugbank-database - Drug-gene associations
  • clinicaltrials-database - Matching trials
  • opentargets-database - Target validation
  • pubmed-database - Literature evidence
  • reportlab - Generate clinical report

Workflow:

Step 1: Parse and filter VCF file
- Use pysam to read tumor VCF
- Filter for high-quality variants (QUAL > 30, DP > 20)
- Extract variant positions, alleles, and VAF (variant allele frequency)
- Separate SNVs, indels, and structural variants

Step 2: Annotate variants with Ensembl
- Query Ensembl VEP API for functional consequences
- Classify variants: missense, nonsense, frameshift, splice site
- Extract transcript information and protein changes
- Identify canonical transcripts for each gene

Step 3: Query ClinVar for known pathogenic variants
- Search ClinVar by genomic coordinates
- Extract clinical significance classifications
- Note conflicting interpretations and review status
- Prioritize variants with "Pathogenic" or "Likely Pathogenic" labels

Step 4: Query COSMIC for somatic cancer mutations
- Search COSMIC for each variant
- Extract mutation frequency across cancer types
- Identify hotspot mutations (high recurrence)
- Note drug resistance mutations

Step 5: Retrieve gene information from NCBI Gene
- Get detailed gene descriptions
- Extract associated phenotypes and diseases
- Identify oncogene vs tumor suppressor classification
- Note gene function and biological pathways

Step 6: Assess protein-level impact with UniProt
- Query UniProt for protein domain information
- Map variants to functional domains (kinase domain, binding site)
- Check if variant affects active sites or protein stability
- Retrieve post-translational modification sites

Step 7: Search DrugBank for targetable alterations
- Query for drugs targeting mutated genes
- Filter for FDA-approved and investigational drugs
- Extract mechanism of action and indications
- Prioritize variants with approved targeted therapies

Step 8: Query Open Targets for target-disease associations
- Validate therapeutic hypotheses
- Assess target tractability scores
- Review clinical precedence for each gene-disease pair

Step 9: Search ClinicalTrials.gov for matching trials
- Build query with: cancer type + gene names + variants
- Filter for: recruiting status, phase II/III trials
- Extract trial eligibility criteria
- Note geographic locations and contact information

Step 10: Literature search for clinical evidence
- PubMed query: "[gene] AND [variant] AND [cancer type]"
- Focus on: case reports, clinical outcomes, resistance mechanisms
- Extract relevant prognostic or predictive information

Step 11: Classify variants by actionability
Tier 1: FDA-approved therapy for this variant
Tier 2: Clinical trial available for this variant
Tier 3: Therapy approved for variant in different cancer
Tier 4: Biological evidence but no approved therapy

Step 12: Generate clinical genomics report
- Executive summary of key findings
- Table of actionable variants with evidence levels
- Therapeutic recommendations with supporting evidence
- Clinical trial options with eligibility information
- Prognostic implications based on mutation profile
- References to guidelines (NCCN, ESMO, AMP/ASCO/CAP)
- Generate professional PDF using ReportLab

Expected Output:
- Annotated variant list with clinical significance
- Tiered list of actionable mutations
- Therapeutic recommendations with evidence levels
- Matching clinical trials
- Comprehensive clinical genomics report (PDF)

Example 4: Cancer Subtype Classification from Gene Expression

Objective: Classify breast cancer subtypes using RNA-seq data and identify subtype-specific therapeutic vulnerabilities.

Skills Used:

  • pydeseq2 - Differential expression
  • scanpy - Clustering and visualization
  • scikit-learn - Machine learning classification
  • gene-database - Gene annotation
  • reactome-database - Pathway analysis
  • opentargets-database - Drug targets
  • pubmed-database - Literature validation
  • matplotlib - Visualization
  • seaborn - Heatmaps

Workflow:

Step 1: Load and preprocess RNA-seq data
- Load count matrix (genes × samples)
- Filter low-expression genes (mean counts < 10)
- Normalize with DESeq2 size factors
- Apply variance-stabilizing transformation (VST)

Step 2: Classify samples using PAM50 genes
- Query NCBI Gene for PAM50 classifier gene list
- Extract expression values for PAM50 genes
- Train Random Forest classifier on labeled training data
- Predict subtypes: Luminal A, Luminal B, HER2+, Basal, Normal-like
- Validate with published markers (ESR1, PGR, ERBB2, MKI67)

Step 3: Perform differential expression for each subtype
- Use PyDESeq2 to compare each subtype vs all others
- Apply multiple testing correction (FDR < 0.05)
- Filter by log2 fold change (|LFC| > 1.5)
- Identify subtype-specific signature genes

Step 4: Annotate differentially expressed genes
- Query NCBI Gene for detailed annotations
- Classify as oncogene, tumor suppressor, or other
- Extract biological process and molecular function terms

Step 5: Pathway enrichment analysis
- Submit gene lists to Reactome API
- Identify enriched pathways for each subtype (p < 0.01)
- Focus on druggable pathways (kinase signaling, metabolism)
- Compare pathway profiles across subtypes

Step 6: Identify therapeutic targets with Open Targets
- Query Open Targets for each upregulated gene
- Filter by tractability score > 5
- Prioritize targets with clinical precedence
- Extract associated drugs and development phase

Step 7: Create comprehensive visualization
- Generate UMAP projection of all samples colored by subtype
- Create heatmap of PAM50 genes across subtypes
- Produce volcano plots for each subtype comparison
- Generate pathway enrichment dot plots
- Create drug target-pathway network diagrams

Step 8: Literature validation
- Search PubMed for each predicted therapeutic target
- Query: "[gene] AND [subtype] AND breast cancer AND therapy"
- Summarize clinical evidence and ongoing trials
- Note any resistance mechanisms reported

Step 9: Generate subtype-specific recommendations
For each subtype:
- List top 5 differentially expressed genes
- Identify enriched biological pathways
- Recommend therapeutic strategies based on vulnerabilities
- Cite supporting evidence from literature

Step 10: Create comprehensive report
- Classification results with confidence scores
- Differential expression tables for each subtype
- Pathway enrichment summaries
- Therapeutic target recommendations
- Publication-quality figures
- Export to PDF with citations

Expected Output:
- Sample classification into molecular subtypes
- Subtype-specific gene signatures
- Pathway enrichment profiles
- Prioritized therapeutic targets for each subtype
- Scientific report with visualizations and recommendations

Single-Cell Transcriptomics

Example 5: Single-Cell Atlas of Tumor Microenvironment

Objective: Characterize immune cell populations in tumor microenvironment and identify immunotherapy biomarkers.

Skills Used:

  • scanpy - Single-cell analysis
  • scvi-tools - Batch correction and integration
  • cellxgene-census - Reference data
  • gene-database - Cell type markers
  • anndata - Data structure
  • arboreto - Gene regulatory networks
  • pytorch-lightning - Deep learning
  • matplotlib - Visualization
  • statistical-analysis - Hypothesis testing

Workflow:

Step 1: Load and QC 10X Genomics data
- Use Scanpy to read 10X h5 files
- Calculate QC metrics: n_genes, n_counts, pct_mitochondrial
- Identify mitochondrial genes (MT- prefix)
- Filter cells: 200 < n_genes < 5000, pct_mt < 20%
- Filter genes: expressed in at least 10 cells
- Document filtering criteria and cell retention rate

Step 2: Normalize and identify highly variable genes
- Normalize to 10,000 counts per cell
- Log-transform data (log1p)
- Store raw counts in adata.raw
- Identify 3,000 highly variable genes
- Regress out technical variation (n_counts, pct_mt)
- Scale to unit variance, clip at 10 standard deviations

Step 3: Integrate with reference atlas using scVI
- Download reference tumor microenvironment data from Cellxgene Census
- Train scVI model on combined dataset for batch correction
- Use scVI latent representation for downstream analysis
- Generate batch-corrected expression matrix

Step 4: Dimensionality reduction and clustering
- Compute neighborhood graph (n_neighbors=15, n_pcs=50)
- Calculate UMAP embedding for visualization
- Perform Leiden clustering at multiple resolutions (0.3, 0.5, 0.8)
- Select optimal resolution based on silhouette score

Step 5: Identify cell type markers
- Run differential expression for each cluster (Wilcoxon test)
- Calculate marker scores (log fold change, p-value, pct expressed)
- Query NCBI Gene for canonical immune cell markers:
  * T cells: CD3D, CD3E, CD4, CD8A
  * B cells: CD19, MS4A1 (CD20), CD79A
  * Myeloid: CD14, CD68, CD163
  * NK cells: NKG7, GNLY, NCAM1
  * Dendritic: CD1C, CLEC9A, LILRA4

Step 6: Annotate cell types
- Assign cell type labels based on marker expression
- Refine annotations with CellTypist or manual curation
- Identify T cell subtypes: CD4+, CD8+, Tregs, exhausted T cells
- Characterize myeloid cells: M1/M2 macrophages, dendritic cells
- Create cell type proportion tables by sample/condition

Step 7: Identify tumor-specific features
- Compare tumor samples vs normal tissue (if available)
- Identify expanded T cell clones (high proliferation markers)
- Detect exhausted T cells (PDCD1, CTLA4, LAG3, HAVCR2)
- Characterize immunosuppressive populations (Tregs, M2 macrophages)

Step 8: Gene regulatory network inference
- Use Arboreto/GRNBoost2 on each major cell type
- Identify transcription factors driving cell states
- Focus on exhaustion TFs: TOX, TCF7, EOMES
- Build regulatory networks for visualization

Step 9: Statistical analysis of cell proportions
- Calculate cell type frequencies per sample
- Test for significant differences between groups (responders vs non-responders)
- Use statistical-analysis skill for appropriate tests (t-test, Mann-Whitney)
- Calculate effect sizes and confidence intervals

Step 10: Biomarker discovery for immunotherapy response
- Correlate cell type abundances with clinical response
- Identify gene signatures associated with response
- Test signatures: T cell exhaustion, antigen presentation, inflammation
- Validate with published immunotherapy response signatures

Step 11: Create comprehensive visualizations
- UMAP plots colored by: cell type, sample, treatment, key genes
- Dot plots of canonical markers across cell types
- Cell type proportion bar plots by condition
- Heatmap of top differentially expressed genes per cell type
- Gene regulatory network diagrams
- Volcano plots for differentially abundant cell types

Step 12: Generate scientific report
- Methods: QC, normalization, batch correction, clustering
- Results: Cell type composition, differential abundance, markers
- Biomarker analysis: Predictive signatures and validation
- High-quality figures suitable for publication
- Export processed h5ad file and PDF report

Expected Output:
- Annotated single-cell atlas with cell type labels
- Cell type composition analysis
- Biomarker signatures for immunotherapy response
- Gene regulatory networks for key cell states
- Comprehensive report with publication-quality figures

Protein Structure & Function

Example 6: Structure-Based Design of Protein-Protein Interaction Inhibitors

Objective: Design small molecules to disrupt a therapeutically relevant protein-protein interaction.

Skills Used:

  • alphafold-database - Protein structures
  • pdb-database - Experimental structures
  • uniprot-database - Protein information
  • biopython - Structure analysis
  • pyrosetta - Protein design (if available)
  • rdkit - Chemical library generation
  • diffdock - Molecular docking
  • zinc-database - Screening library
  • deepchem - Property prediction
  • pymol - Visualization (external)

Workflow:

Step 1: Retrieve protein structures
- Query AlphaFold Database for both proteins in the interaction
- Download PDB files and confidence scores
- If available, get experimental structures from PDB database
- Compare AlphaFold predictions with experimental structures (if any)

Step 2: Analyze protein interaction interface
- Load structures with BioPython
- Identify interface residues (distance < 5Å between proteins)
- Calculate interface area and binding energy contribution
- Identify hot spot residues (key for binding)
- Map to UniProt to get functional annotations

Step 3: Characterize binding pocket
- Identify cavities at the protein-protein interface
- Calculate pocket volume and surface area
- Assess druggability: depth, hydrophobicity, shape
- Identify hydrogen bond donors/acceptors
- Note any known allosteric sites

Step 4: Query UniProt for known modulators
- Search UniProt for both proteins
- Extract information on known inhibitors or modulators
- Review PTMs that affect interaction
- Check disease-associated mutations in interface

Step 5: Search ZINC15 for fragment library
- Query ZINC for fragments matching pocket criteria:
  * Molecular weight: 150-300 Da
  * LogP: 0-3 (appropriate for PPI inhibitors)
  * Exclude PAINS and aggregators
- Download 1,000-5,000 fragment SMILES

Step 6: Virtual screening with fragment library
- Use DiffDock to dock fragments into interface pocket
- Rank by predicted binding affinity
- Identify fragments binding to hot spot residues
- Select top 50 fragments for elaboration

Step 7: Fragment elaboration with RDKit
- For each fragment hit, generate elaborated molecules:
  * Add substituents to core scaffold
  * Merge fragments binding to adjacent pockets
  * Apply medicinal chemistry filters
- Generate 20-50 analogs per fragment
- Filter by Lipinski's Ro5 and PPI-specific rules (MW 400-700)

Step 8: Second round of virtual screening
- Dock elaborated molecules with DiffDock
- Calculate binding energies and interaction patterns
- Prioritize molecules with:
  * Strong binding to hot spot residues
  * Multiple H-bonds and hydrophobic contacts
  * Favorable predicted ΔG

Step 9: Predict ADMET properties with DeepChem
- Train models on ChEMBL data
- Predict: solubility, permeability, hERG liability
- Filter for drug-like properties
- Rank by overall score (affinity + ADMET)

Step 10: Literature and patent search
- PubMed: "[protein A] AND [protein B] AND inhibitor"
- USPTO: Check for prior art on top scaffolds
- Assess freedom to operate
- Identify any reported PPI inhibitors for this target

Step 11: Prepare molecules for synthesis
- Assess synthetic accessibility (SA score < 4)
- Identify commercial building blocks
- Propose synthetic routes for top 10 candidates
- Calculate estimated synthesis cost

Step 12: Generate comprehensive design report
- Interface analysis with hot spot identification
- Fragment screening results
- Top 10 designed molecules with predicted properties
- Docking poses and interaction diagrams
- Synthetic accessibility assessment
- Comparison to known PPI inhibitors
- Recommendations for experimental validation
- Publication-quality figures and PDF report

Expected Output:
- Interface characterization and hot spot analysis
- Ranked library of designed PPI inhibitors
- Predicted binding modes and affinities
- ADMET property predictions
- Synthetic accessibility assessment
- Comprehensive drug design report

Chemical Safety & Toxicology

Example 7: Predictive Toxicology Assessment

Objective: Assess potential toxicity and safety liabilities of drug candidates before synthesis.

Skills Used:

  • rdkit - Molecular descriptors
  • deepchem - Toxicity prediction
  • chembl-database - Toxicity data
  • pubchem-database - Bioassay data
  • drugbank-database - Known drug toxicities
  • fda-database - Adverse events
  • hmdb-database - Metabolite prediction
  • scikit-learn - Classification models
  • shap - Model interpretability

Workflow:

Step 1: Calculate molecular descriptors
- Load candidate molecules with RDKit
- Calculate physicochemical properties:
  * MW, LogP, TPSA, rotatable bonds, H-bond donors/acceptors
  * Aromatic rings, sp3 fraction, formal charge
- Calculate structural alerts:
  * PAINS patterns
  * Toxic functional groups (nitroaromatics, epoxides, etc.)
  * Genotoxic alerts (Ames mutagenicity)

Step 2: Screen for known toxicophores
- Search for structural alerts using SMARTS patterns:
  * Michael acceptors
  * Aldehyde/ketone reactivity
  * Quinones and quinone-like structures
  * Thioureas and isocyanates
- Flag molecules with high-risk substructures

Step 3: Query ChEMBL for similar compounds with toxicity data
- Perform similarity search (Tanimoto > 0.7)
- Extract toxicity assay results:
  * Cytotoxicity (IC50 values)
  * Hepatotoxicity markers
  * Cardiotoxicity (hERG inhibition)
  * Genotoxicity (Ames test results)
- Analyze structure-toxicity relationships

Step 4: Search PubChem BioAssays for toxicity screening
- Query relevant assays:
  * Tox21 panel (cell viability, stress response, genotoxicity)
  * Liver toxicity assays
  * hERG channel inhibition
- Extract activity data for similar compounds
- Calculate hit rates for concerning assays

Step 5: Train toxicity prediction models with DeepChem
- Load Tox21 dataset from DeepChem
- Train graph convolutional models for:
  * Nuclear receptor signaling
  * Stress response pathways
  * Genotoxicity endpoints
- Validate models with cross-validation
- Predict toxicity for candidate molecules

Step 6: Predict hERG cardiotoxicity liability
- Train DeepChem model on hERG inhibition data from ChEMBL
- Predict IC50 for hERG channel
- Flag compounds with predicted IC50 < 10 μM
- Identify structural features associated with hERG liability

Step 7: Predict hepatotoxicity risk
- Train models on DILI (drug-induced liver injury) datasets
- Extract features: reactive metabolites, mitochondrial toxicity
- Predict hepatotoxicity risk class (low/medium/high)
- Use SHAP values to explain predictions

Step 8: Predict metabolic stability and metabolites
- Identify sites of metabolism using RDKit SMARTS patterns
- Predict CYP450 interactions
- Query HMDB for potential metabolite structures
- Assess if metabolites contain toxic substructures
- Predict metabolic stability (half-life)

Step 9: Check FDA adverse event database
- Query FAERS for approved drugs similar to candidates
- Extract common adverse events
- Identify target organ toxicities
- Calculate reporting odds ratios for serious events

Step 10: Literature review of toxicity mechanisms
- PubMed search: "[scaffold] AND (toxicity OR hepatotoxicity OR cardiotoxicity)"
- Identify mechanistic studies on similar compounds
- Note any case reports of adverse events
- Review preclinical and clinical safety data

Step 11: Assess ADME liabilities
- Predict solubility, permeability, plasma protein binding
- Identify potential drug-drug interaction risks
- Assess blood-brain barrier penetration (for CNS or non-CNS drugs)
- Evaluate metabolic stability

Step 12: Generate safety assessment report
- Executive summary of safety profile for each candidate
- Red flags: structural alerts, predicted toxicities
- Yellow flags: moderate concerns requiring testing
- Green light: acceptable predicted safety profile
- Comparison table of all candidates
- Recommendations for risk mitigation:
  * Structural modifications to reduce toxicity
  * Priority in vitro assays to run
  * Preclinical study design recommendations
- Comprehensive PDF report with:
  * Toxicophore analysis
  * Prediction model results with confidence
  * SHAP interpretation plots
  * Literature evidence
  * Risk assessment matrix

Expected Output:
- Toxicity predictions for all candidates
- Structural alert analysis
- hERG, hepatotoxicity, and genotoxicity risk scores
- Metabolite predictions
- Prioritized list with safety rankings
- Comprehensive toxicology assessment report

Clinical Trial Analysis

Example 8: Competitive Landscape Analysis for New Indication

Objective: Analyze the clinical trial landscape for a specific indication to inform development strategy.

Skills Used:

  • clinicaltrials-database - Trial registry
  • fda-database - Drug approvals
  • pubmed-database - Published results
  • drugbank-database - Approved drugs
  • opentargets-database - Target validation
  • polars - Data manipulation
  • matplotlib - Visualization
  • seaborn - Statistical plots
  • reportlab - Report generation

Workflow:

Step 1: Search ClinicalTrials.gov for all trials in indication
- Query: "[disease/indication]"
- Filter: All phases, all statuses
- Extract fields:
  * NCT ID, title, phase, status
  * Start date, completion date, enrollment
  * Intervention/drug names
  * Primary/secondary outcomes
  * Sponsor and collaborators
- Export to structured JSON/CSV

Step 2: Categorize trials by mechanism of action
- Extract drug names and intervention types
- Query DrugBank for mechanism of action
- Query Open Targets for target information
- Classify into categories:
  * Small molecules vs biologics
  * Target class (kinase inhibitor, antibody, etc.)
  * Novel vs repurposing

Step 3: Analyze trial phase progression
- Calculate success rates by phase (I → II, II → III)
- Identify terminated trials and reasons for termination
- Track time from phase I start to NDA submission
- Calculate median development timelines

Step 4: Search FDA database for recent approvals
- Query FDA drug approvals in the indication (last 10 years)
- Extract approval dates, indications, priority review status
- Note any accelerated approvals or breakthroughs
- Review FDA drug labels for safety information

Step 5: Extract outcome measures
- Compile all primary endpoints used
- Identify most common endpoints:
  * Survival (OS, PFS, DFS)
  * Response rates (ORR, CR, PR)
  * Biomarker endpoints
  * Patient-reported outcomes
- Note emerging or novel endpoints

Step 6: Analyze competitive dynamics
- Identify leading companies and their pipelines
- Map trials by phase for each major competitor
- Note partnership and licensing deals
- Assess crowded vs underserved patient segments

Step 7: Search PubMed for published trial results
- Query: "[NCT ID]" for each completed trial
- Extract published outcomes and conclusions
- Identify trends in efficacy and safety
- Note any unmet needs highlighted in discussions

Step 8: Analyze target validation evidence
- Query Open Targets for target-disease associations
- Extract genetic evidence scores
- Review tractability assessments
- Compare targets being pursued across trials

Step 9: Identify unmet needs and opportunities
- Analyze trial failures for common patterns
- Identify patient populations excluded from trials
- Note resistance mechanisms or limitations mentioned
- Assess gaps in current therapeutic approaches

Step 10: Perform temporal trend analysis
- Plot trial starts over time (by phase, mechanism)
- Identify increasing or decreasing interest in targets
- Correlate with publication trends and scientific advances
- Predict future trends in the space

Step 11: Create comprehensive visualizations
- Timeline of all trials (Gantt chart style)
- Phase distribution pie chart
- Mechanism of action breakdown
- Geographic distribution of trials
- Enrollment trends over time
- Success rate funnels (Phase I → II → III → Approval)
- Sponsor/company market share

Step 12: Generate competitive intelligence report
- Executive summary of competitive landscape
- Total number of active programs by phase
- Key players and their development stage
- Standard of care and approved therapies
- Emerging approaches and novel targets
- Identified opportunities and white space
- Risk analysis (crowded targets, high failure rates)
- Strategic recommendations:
  * Patient population to target
  * Differentiation strategies
  * Partnership opportunities
  * Regulatory pathway considerations
- Export as professional PDF with citations and data tables

Expected Output:
- Comprehensive trial database for indication
- Success rate and timeline statistics
- Competitive landscape mapping
- Unmet need analysis
- Strategic recommendations
- Publication-ready report with visualizations

Metabolomics & Systems Biology

Example 9: Multi-Omics Integration for Metabolic Disease

Objective: Integrate transcriptomics, proteomics, and metabolomics to identify dysregulated pathways in metabolic disease.

Skills Used:

  • pydeseq2 - RNA-seq analysis
  • pyopenms - Mass spectrometry
  • hmdb-database - Metabolite identification
  • metabolomics-workbench-database - Public datasets
  • kegg-database - Pathway mapping
  • reactome-database - Pathway analysis
  • string-database - Protein interactions
  • statsmodels - Multi-omics correlation
  • networkx - Network analysis
  • pymc - Bayesian modeling

Workflow:

Step 1: Process RNA-seq data
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare disease vs control (adjusted p < 0.05, |LFC| > 1)
- Extract gene symbols and fold changes
- Map to KEGG gene IDs

Step 2: Process proteomics data
- Load LC-MS/MS results with PyOpenMS
- Perform peptide identification and quantification
- Normalize protein abundances
- Run statistical testing (t-test or limma)
- Extract significant proteins (p < 0.05, |FC| > 1.5)

Step 3: Process metabolomics data
- Load untargeted metabolomics data (mzML format) with PyOpenMS
- Perform peak detection and alignment
- Match features to HMDB database by accurate mass
- Annotate metabolites with MS/MS fragmentation
- Extract putative identifications (Level 2/3)
- Perform statistical analysis (FDR < 0.05, |FC| > 2)

Step 4: Search Metabolomics Workbench for public data
- Query for same disease or tissue type
- Download relevant studies
- Reprocess for consistency with own data
- Use as validation cohort

Step 5: Map all features to KEGG pathways
- Map genes to KEGG orthology (KO) terms
- Map proteins to KEGG identifiers
- Map metabolites to KEGG compound IDs
- Identify pathways with multi-omics coverage

Step 6: Perform pathway enrichment analysis
- Test for enrichment in KEGG pathways
- Test for enrichment in Reactome pathways
- Apply Fisher's exact test with multiple testing correction
- Focus on pathways with hits in ≥2 omics layers

Step 7: Build protein-metabolite networks
- Query STRING for protein-protein interactions
- Map proteins to KEGG reactions
- Connect enzymes to their substrates/products
- Build integrated network with genes → proteins → metabolites

Step 8: Network topology analysis with NetworkX
- Calculate node centrality (degree, betweenness)
- Identify hub metabolites and key enzymes
- Find bottleneck reactions
- Detect network modules with community detection
- Identify dysregulated subnetworks

Step 9: Correlation analysis across omics layers
- Calculate Spearman correlations between:
  * Gene expression and protein abundance
  * Protein abundance and metabolite levels
  * Gene expression and metabolites (for enzyme-product pairs)
- Use statsmodels for significance testing
- Focus on enzyme-metabolite pairs with expected relationships

Step 10: Bayesian network modeling with PyMC
- Build probabilistic graphical model of pathway
- Model causal relationships: gene → protein → metabolite
- Incorporate prior knowledge from KEGG/Reactome
- Perform inference to identify key regulatory nodes
- Estimate effect sizes and uncertainties

Step 11: Identify therapeutic targets
- Prioritize enzymes with:
  * Significant changes in all three omics layers
  * High network centrality
  * Druggable target class (kinases, transporters, etc.)
- Query DrugBank for existing inhibitors
- Search PubMed for validation in disease models

Step 12: Create comprehensive multi-omics report
- Summary statistics for each omics layer
- Venn diagram of overlapping pathway hits
- Pathway enrichment dot plots
- Integrated network visualization (color by fold change)
- Correlation heatmaps (enzyme-metabolite pairs)
- Bayesian network structure
- Table of prioritized therapeutic targets
- Biological interpretation and mechanistic insights
- Generate publication-quality figures
- Export PDF report with all results

Expected Output:
- Integrated multi-omics dataset
- Dysregulated pathway identification
- Multi-omics network model
- Prioritized list of therapeutic targets
- Comprehensive systems biology report

Materials Science & Chemistry

Example 10: High-Throughput Materials Discovery for Battery Applications

Objective: Discover novel solid electrolyte materials for lithium-ion batteries using computational screening.

Skills Used:

  • pymatgen - Materials analysis
  • matminer - Feature engineering
  • scikit-learn - Machine learning
  • pymoo - Multi-objective optimization
  • ase - Atomic simulation
  • sympy - Symbolic math
  • vaex - Large dataset handling
  • matplotlib - Visualization
  • scientific-writing - Report generation

Workflow:

Step 1: Generate candidate materials library
- Use Pymatgen to enumerate compositions:
  * Li-containing compounds (Li₁₋ₓM₁₊ₓX₂)
  * M = transition metals (Zr, Ti, Ta, Nb)
  * X = O, S, Se
- Generate ~10,000 candidate compositions
- Apply charge neutrality constraints

Step 2: Filter by thermodynamic stability
- Query Materials Project database via Pymatgen
- Calculate formation energy from elements
- Calculate energy above convex hull (E_hull)
- Filter: E_hull < 50 meV/atom (likely stable)
- Retain ~2,000 thermodynamically plausible compounds

Step 3: Predict crystal structures
- Use Pymatgen structure predictor
- Generate most likely crystal structures for each composition
- Consider common structure types: LISICON, NASICON, garnet, perovskite
- Calculate structural descriptors

Step 4: Calculate material properties with Pymatgen
- Lattice parameters and volume
- Density
- Packing fraction
- Ionic radii and bond lengths
- Coordination environments

Step 5: Feature engineering with matminer
- Calculate compositional features:
  * Elemental property statistics (electronegativity, ionic radius)
  * Valence electron concentrations
  * Stoichiometric attributes
- Calculate structural features:
  * Pore size distribution
  * Site disorder parameters
  * Partial radial distribution functions

Step 6: Build ML models for Li⁺ conductivity prediction
- Collect training data from literature (experimental conductivities)
- Train ensemble models with scikit-learn:
  * Random Forest
  * Gradient Boosting
  * Neural Network
- Use 5-fold cross-validation
- Predict ionic conductivity for all candidates

Step 7: Predict additional properties
- Electrochemical stability window (ML model)
- Mechanical properties (bulk modulus, shear modulus)
- Interfacial resistance (estimate from structure)
- Synthesis temperature (ML prediction from similar compounds)

Step 8: Multi-objective optimization with PyMOO
Define optimization objectives:
- Maximize: ionic conductivity (>10⁻³ S/cm target)
- Maximize: electrochemical window (>4.5V target)
- Minimize: synthesis temperature (<800°C preferred)
- Minimize: cost (based on elemental abundance)

Run NSGA-II to find Pareto optimal solutions
Extract top 50 candidates from Pareto front

Step 9: Analyze Pareto optimal materials
- Identify composition trends (which elements appear frequently)
- Analyze structure-property relationships
- Calculate trade-offs between objectives
- Identify "sweet spot" compositions

Step 10: Validate predictions with DFT calculations
- Select top 10 candidates for detailed study
- Set up DFT calculations (VASP-like, if available via ASE)
- Calculate:
  * Accurate formation energies
  * Li⁺ migration barriers (NEB calculations)
  * Electronic band gap
  * Elastic constants
- Compare DFT results with ML predictions

Step 11: Literature and patent search
- Search for prior art on top candidates
- PubMed and Google Scholar: "[composition] AND electrolyte"
- USPTO: Check for existing patents on similar compositions
- Identify any experimental reports on related materials

Step 12: Generate materials discovery report
- Summary of screening workflow and statistics
- Pareto front visualization (conductivity vs stability vs cost)
- Structure visualization of top candidates
- Property comparison table
- Composition-property trend analysis
- DFT validation results
- Predicted performance vs state-of-art materials
- Synthesis recommendations
- IP landscape summary
- Prioritized list of 5-10 materials for experimental validation
- Export as publication-ready PDF

Expected Output:
- Screened library of 10,000+ materials
- ML models for property prediction
- Pareto-optimal set of 50 candidates
- Detailed analysis of top 10 materials
- DFT validation results
- Comprehensive materials discovery report

Digital Pathology

Example 11: Automated Tumor Detection in Whole Slide Images

Objective: Develop and validate a deep learning model for automated tumor detection in histopathology images.

Skills Used:

  • histolab - Whole slide image processing
  • pathml - Computational pathology
  • pytorch-lightning - Deep learning
  • torchvision - Image models
  • scikit-learn - Model evaluation
  • pydicom - DICOM handling
  • omero-integration - Image management
  • matplotlib - Visualization
  • shap - Model interpretability

Workflow:

Step 1: Load whole slide images with HistoLab
- Load WSI files (SVS, TIFF formats)
- Extract slide metadata and magnification levels
- Visualize slide thumbnails
- Inspect tissue area vs background

Step 2: Tile extraction and preprocessing
- Use HistoLab to extract tiles (256×256 pixels at 20× magnification)
- Filter tiles:
  * Remove background (tissue percentage > 80%)
  * Apply color normalization (Macenko or Reinhard method)
  * Filter out artifacts and bubbles
- Extract ~100,000 tiles per slide across all slides

Step 3: Create annotations (if training from scratch)
- Load pathologist annotations (if available via OMERO)
- Convert annotations to tile-level labels
- Categories: tumor, stroma, necrosis, normal
- Balance classes through stratified sampling

Step 4: Set up PathML pipeline
- Create PathML SlideData objects
- Define preprocessing pipeline:
  * Stain normalization
  * Color augmentation (HSV jitter)
  * Rotation and flipping
- Split data: 70% train, 15% validation, 15% test

Step 5: Build deep learning model with PyTorch Lightning
- Architecture: ResNet50 or EfficientNet backbone
- Add custom classification head for tissue types
- Define training pipeline:
  * Loss function: Cross-entropy or Focal loss
  * Optimizer: Adam with learning rate scheduling
  * Augmentations: rotation, flip, color jitter, elastic deformation
  * Batch size: 32
  * Mixed precision training

Step 6: Train model
- Train on tile-level labels
- Monitor metrics: accuracy, F1 score, AUC
- Use early stopping on validation loss
- Save best model checkpoint
- Training time: ~6-12 hours on GPU

Step 7: Evaluate model performance
- Test on held-out test set
- Calculate metrics with scikit-learn:
  * Accuracy, precision, recall, F1 per class
  * Confusion matrix
  * ROC curves and AUC
- Compute confidence intervals with bootstrapping

Step 8: Slide-level aggregation
- Apply model to all tiles in each test slide
- Aggregate predictions:
  * Majority voting
  * Weighted average by confidence
  * Spatial smoothing with convolution
- Generate probability heatmaps overlaid on WSI

Step 9: Model interpretability with SHAP
- Apply GradCAM or SHAP to explain predictions
- Visualize which regions contribute to tumor classification
- Generate attention maps showing model focus
- Validate that model attends to relevant histological features

Step 10: Clinical validation
- Compare model predictions with pathologist diagnosis
- Calculate inter-rater agreement (kappa score)
- Identify discordant cases for review
- Analyze error types: false positives, false negatives

Step 11: Integration with OMERO
- Upload processed slides and heatmaps to OMERO server
- Attach model predictions as slide metadata
- Enable pathologist review interface
- Store annotations and corrections for model retraining

Step 12: Generate clinical validation report
- Model architecture and training details
- Performance metrics with confidence intervals
- Slide-level accuracy vs pathologist ground truth
- Heatmap visualizations for representative cases
- Analysis of failure modes
- Comparison with published methods
- Discussion of clinical applicability
- Recommendations for deployment and monitoring
- Export PDF report for regulatory submission (if needed)

Expected Output:
- Trained deep learning model for tumor detection
- Tile-level and slide-level predictions
- Probability heatmaps for visualization
- Performance metrics and validation results
- Model interpretation visualizations
- Clinical validation report

Lab Automation & Protocol Design

Example 12: Automated High-Throughput Screening Protocol

Objective: Design and execute an automated compound screening workflow using liquid handling robots.

Skills Used:

  • pylabrobot - Lab automation
  • opentrons-integration - Opentrons protocol
  • benchling-integration - Sample tracking
  • protocolsio-integration - Protocol documentation
  • simpy - Process simulation
  • polars - Data processing
  • matplotlib - Plate visualization
  • reportlab - Report generation

Workflow:

Step 1: Define screening campaign in Benchling
- Create compound library in Benchling registry
- Register all compounds with structure, concentration, location
- Define plate layouts (384-well format)
- Track compound source plates in inventory
- Set up ELN entry for campaign documentation

Step 2: Design assay protocol
- Define assay steps:
  * Dispense cells (5000 cells/well)
  * Add compounds (dose-response curve, 10 concentrations)
  * Incubate 48 hours at 37°C
  * Add detection reagent (cell viability assay)
  * Read luminescence signal
- Calculate required reagent volumes
- Document protocol in Protocols.io
- Share with team for review

Step 3: Simulate workflow with SimPy
- Model liquid handler, incubator, plate reader as resources
- Simulate timing for 20 plates (7,680 wells)
- Identify bottlenecks (plate reader reads take 5 min/plate)
- Optimize scheduling: stagger plate processing
- Validate that throughput goal is achievable (20 plates/day)

Step 4: Design plate layout
- Use PyLabRobot to generate plate maps:
  * Columns 1-2: positive controls (DMSO)
  * Columns 3-22: compound titrations (10 concentrations in duplicate)
  * Columns 23-24: negative controls (cytotoxic control)
- Randomize compound positions across plates
- Account for edge effects (avoid outer wells for samples)
- Export plate maps to CSV

Step 5: Create Opentrons protocol for cell seeding
- Write Python protocol using Opentrons API 2.0
- Steps:
  * Aspirate cells from reservoir
  * Dispense 40 μL cell suspension per well
  * Tips: use P300 multi-channel for speed
  * Include mixing steps to prevent settling
- Simulate protocol in Opentrons app
- Test on one plate before full run

Step 6: Create Opentrons protocol for compound addition
- Acoustic liquid handler (Echo) or pin tool for nanoliter transfers
- If using Opentrons:
  * Source: 384-well compound plates
  * Transfer 100 nL compound (in DMSO) to assay plates
  * Use P20 for precision
  * Prepare serial dilutions on deck if needed
- Account for DMSO normalization (1% final)

Step 7: Integrate with Benchling for sample tracking
- Use Benchling API to:
  * Retrieve compound information (structure, batch, concentration)
  * Log plate creation in inventory
  * Create transfer records for audit trail
  * Link assay plates to ELN entry

Step 8: Execute automated workflow
- Day 1: Seed cells with Opentrons
- Day 1 (4h later): Add compounds with Opentrons
- Day 3: Add detection reagent (manual or automated)
- Day 3 (2h later): Read plates on plate reader
- Store plates at 4°C between steps

Step 9: Collect and process data
- Export raw luminescence data from plate reader
- Load data with Polars for fast processing
- Normalize data:
  * Subtract background (media-only wells)
  * Calculate % viability relative to DMSO control
  * Apply plate-wise normalization to correct systematic effects
- Quality control:
  * Z' factor calculation (> 0.5 for acceptable assay)
  * Coefficient of variation for controls (< 10%)
  * Flag plates with poor QC metrics

Step 10: Dose-response curve fitting
- Fit 4-parameter logistic curves for each compound
- Calculate IC50, Hill slope, max/min response
- Use scikit-learn or scipy for curve fitting
- Compute 95% confidence intervals
- Flag compounds with poor curve fits (R² < 0.8)

Step 11: Hit identification and triage
- Define hit criteria:
  * IC50 < 10 μM
  * Max inhibition > 50%
  * Curve quality: R² > 0.8
- Prioritize hits by potency
- Check for PAINS patterns with RDKit
- Cross-reference with known aggregators/frequent hitters

Step 12: Visualize results and generate report
- Create plate heatmaps showing % viability
- Dose-response curve plots for hits
- Scatter plot: potency vs max effect
- QC metric summary across plates
- Structure visualization of top 20 hits
- Generate campaign summary report:
  * Screening statistics (compounds tested, hit rate)
  * QC metrics and data quality assessment
  * Hit list with structures and IC50 values
  * Protocol documentation from Protocols.io
  * Raw data files and analysis code
  * Recommendations for confirmation assays
- Update Benchling ELN with results
- Export PDF report for stakeholders

Expected Output:
- Automated screening protocols (Opentrons Python files)
- Executed screen of 384-well plates
- Quality-controlled dose-response data
- Hit list with IC50 values
- Comprehensive screening report

Agricultural Genomics

Example 13: GWAS for Crop Yield Improvement

Objective: Identify genetic markers associated with drought tolerance and yield in a crop species.

Skills Used:

  • biopython - Sequence analysis
  • pysam - VCF processing
  • gwas-database - Public GWAS data
  • ensembl-database - Plant genomics
  • gene-database - Gene annotation
  • scanpy - Population structure (adapted for genetic data)
  • scikit-learn - PCA and clustering
  • statsmodels - Association testing
  • matplotlib - Manhattan plots
  • seaborn - Visualization

Workflow:

Step 1: Load and QC genotype data
- Load VCF file with pysam
- Filter variants:
  * Call rate > 95%
  * Minor allele frequency (MAF) > 5%
  * Hardy-Weinberg equilibrium p > 1e-6
- Convert to numeric genotype matrix (0, 1, 2)
- Retain ~500,000 SNPs after QC

Step 2: Assess population structure
- Calculate genetic relationship matrix
- Perform PCA with scikit-learn (use top 10 PCs)
- Visualize population structure (PC1 vs PC2)
- Identify distinct subpopulations or admixture
- Note: will use PCs as covariates in GWAS

Step 3: Load and process phenotype data
- Drought tolerance score (1-10 scale, measured under stress)
- Grain yield (kg/hectare)
- Days to flowering
- Plant height
- Quality control:
  * Remove outliers (> 3 SD from mean)
  * Transform if needed (log or rank-based for skewed traits)
  * Adjust for environmental covariates (field, year)

Step 4: Calculate kinship matrix
- Compute genetic relatedness matrix
- Account for population structure and relatedness
- Will use in mixed linear model to control for confounding

Step 5: Run genome-wide association study
- For each phenotype, test association with each SNP
- Use mixed linear model (MLM) in statsmodels:
  * Fixed effects: SNP genotype, PCs (top 10)
  * Random effects: kinship matrix
  * Bonferroni threshold: p < 5e-8 (genome-wide significance)
- Multiple testing correction: Bonferroni or FDR
- Calculate genomic inflation factor (λ) to check for inflation

Step 6: Identify significant associations
- Extract SNPs passing significance threshold
- Determine lead SNPs (most significant in each locus)
- Define loci: extend ±500 kb around lead SNP
- Identify independent associations via conditional analysis

Step 7: Annotate significant loci
- Map SNPs to genes using Ensembl Plants API
- Identify genic vs intergenic SNPs
- For genic SNPs:
  * Determine consequence (missense, synonymous, intronic, UTR)
  * Extract gene names and descriptions
- Query NCBI Gene for gene function
- Prioritize genes with known roles in stress response or development

Step 8: Search GWAS Catalog for prior reports
- Query GWAS Catalog for similar traits in same or related species
- Check for replication of known loci
- Identify novel vs known associations

Step 9: Functional enrichment analysis
- Extract all genes within significant loci
- Perform GO enrichment analysis
- Test for enrichment in KEGG pathways
- Focus on pathways related to:
  * Drought stress response (ABA signaling, osmotic adjustment)
  * Photosynthesis and carbon fixation
  * Root development

Step 10: Estimate SNP heritability and genetic architecture
- Calculate variance explained by significant SNPs
- Estimate SNP-based heritability (proportion of variance explained)
- Assess genetic architecture: few large-effect vs many small-effect loci

Step 11: Build genomic prediction model
- Train genomic selection model with scikit-learn:
  * Ridge regression (GBLUP equivalent)
  * Elastic net
  * Random Forest
- Use all SNPs (not just significant ones)
- Cross-validate to predict breeding values
- Assess prediction accuracy

Step 12: Generate GWAS report
- Manhattan plots for each trait
- QQ plots to assess test calibration
- Regional association plots for significant loci
- Gene models overlaid on loci
- Table of significant SNPs with annotations
- Functional enrichment results
- Genomic prediction accuracy
- Biological interpretation:
  * Candidate genes for drought tolerance
  * Potential molecular mechanisms
  * Implications for breeding programs
- Recommendations:
  * SNPs to use for marker-assisted selection
  * Genes for functional validation
  * Crosses to generate mapping populations
- Export publication-quality PDF with all results

Expected Output:
- Significant SNP-trait associations
- Annotated candidate genes
- Functional enrichment analysis
- Genomic prediction models
- Comprehensive GWAS report
- Recommendations for breeding programs

Neuroscience & Brain Imaging

Example 14: Brain Connectivity Analysis from fMRI Data

Objective: Analyze resting-state fMRI data to identify altered brain connectivity patterns in disease.

Skills Used:

  • neurokit2 - Neurophysiological signal processing
  • nilearn (external) - Neuroimaging analysis
  • scikit-learn - Classification and clustering
  • networkx - Graph theory analysis
  • statsmodels - Statistical testing
  • torch_geometric - Graph neural networks
  • pymc - Bayesian modeling
  • matplotlib - Brain visualization
  • seaborn - Connectivity matrices

Workflow:

Step 1: Load and preprocess fMRI data
# Note: Use nilearn or similar for fMRI-specific preprocessing
- Load 4D fMRI images (BOLD signal)
- Preprocessing:
  * Motion correction (realignment)
  * Slice timing correction
  * Spatial normalization to MNI space
  * Smoothing (6mm FWHM Gaussian kernel)
  * Temporal filtering (0.01-0.1 Hz bandpass)
  * Nuisance regression (motion, CSF, white matter)

Step 2: Define brain regions (parcellation)
- Apply brain atlas (e.g., AAL, Schaefer 200-region atlas)
- Extract average time series for each region
- Result: 200 time series per subject (one per brain region)

Step 3: Signal cleaning with NeuroKit2
- Denoise time series
- Remove physiological artifacts
- Apply additional bandpass filtering if needed
- Identify and handle outlier time points

Step 4: Calculate functional connectivity
- Compute pairwise Pearson correlations between all regions
- Result: 200×200 connectivity matrix per subject
- Fisher z-transform correlations for group statistics
- Threshold weak connections (|r| < 0.2)

Step 5: Graph theory analysis with NetworkX
- Convert connectivity matrices to graphs
- Calculate global network metrics:
  * Clustering coefficient (local connectivity)
  * Path length (integration)
  * Small-worldness (balance of segregation and integration)
  * Modularity (community structure)
- Calculate node-level metrics:
  * Degree centrality
  * Betweenness centrality
  * Eigenvector centrality
  * Participation coefficient (inter-module connectivity)

Step 6: Statistical comparison between groups
- Compare patients vs healthy controls
- Use statsmodels for group comparisons:
  * Paired or unpaired t-tests for connectivity edges
  * FDR correction for multiple comparisons across all edges
  * Identify edges with significantly different connectivity
- Compare global and node-level network metrics
- Calculate effect sizes (Cohen's d)

Step 7: Identify altered subnetworks
- Threshold statistical maps (FDR < 0.05)
- Identify clusters of altered connectivity
- Map to functional brain networks:
  * Default mode network (DMN)
  * Salience network (SN)
  * Central executive network (CEN)
  * Sensorimotor network
- Visualize altered connections on brain surfaces

Step 8: Machine learning classification
- Train classifier to distinguish patients from controls
- Use scikit-learn Random Forest or SVM
- Features: connectivity values or network metrics
- Cross-validation (10-fold)
- Calculate accuracy, sensitivity, specificity, AUC
- Identify most discriminative features (connectivity edges)

Step 9: Graph neural network analysis with Torch Geometric
- Build graph neural network (GCN or GAT)
- Input: connectivity matrices as adjacency matrices
- Train to predict diagnosis
- Extract learned representations
- Visualize latent space (UMAP)
- Interpret which brain regions are most important

Step 10: Bayesian network modeling with PyMC
- Build directed graphical model of brain networks
- Estimate effective connectivity (directional influence)
- Incorporate prior knowledge about anatomical connections
- Perform posterior inference
- Identify key driver regions in disease

Step 11: Clinical correlation analysis
- Correlate network metrics with clinical scores:
  * Symptom severity
  * Cognitive performance
  * Treatment response
- Use Spearman or Pearson correlation
- Identify brain-behavior relationships

Step 12: Generate comprehensive neuroimaging report
- Brain connectivity matrices (patients vs controls)
- Statistical comparison maps on brain surface
- Network metric comparison bar plots
- Graph visualizations (circular or force-directed layout)
- Machine learning ROC curves
- Brain-behavior correlation plots
- Clinical interpretation:
  * Which networks are disrupted?
  * Relationship to symptoms
  * Potential biomarker utility
- Recommendations:
  * Brain regions for therapeutic targeting (TMS, DBS)
  * Network metrics as treatment response predictors
- Export publication-ready PDF with brain visualizations

Expected Output:
- Functional connectivity matrices for all subjects
- Statistical maps of altered connectivity
- Graph theory metrics
- Machine learning classification model
- Brain-behavior correlations
- Comprehensive neuroimaging report

Environmental Microbiology

Example 15: Metagenomic Analysis of Environmental Samples

Objective: Characterize microbial community composition and functional potential from environmental DNA samples.

Skills Used:

  • biopython - Sequence processing
  • pysam - BAM file handling
  • ena-database - Sequence data
  • uniprot-database - Protein annotation
  • kegg-database - Pathway analysis
  • etetoolkit - Phylogenetic trees
  • scikit-bio - Microbial ecology
  • networkx - Co-occurrence networks
  • statsmodels - Diversity statistics
  • matplotlib - Visualization

Workflow:

Step 1: Load and QC metagenomic reads
- Load FASTQ files with BioPython
- Quality control with FastQC-equivalent:
  * Remove adapters and low-quality bases (Q < 20)
  * Filter short reads (< 50 bp)
  * Remove host contamination (if applicable)
- Subsample to even depth if comparing samples

Step 2: Taxonomic classification
- Use Kraken2-like approach or query ENA database
- Classify reads to taxonomic lineages
- Generate abundance table:
  * Rows: taxa (species or OTUs)
  * Columns: samples
  * Values: read counts or relative abundance
- Summarize at different levels: phylum, class, order, family, genus, species

Step 3: Calculate diversity metrics with scikit-bio
- Alpha diversity (within-sample):
  * Richness (number of species)
  * Shannon entropy
  * Simpson diversity
  * Chao1 estimated richness
- Beta diversity (between-sample):
  * Bray-Curtis dissimilarity
  * Weighted/unweighted UniFrac distance
  * Jaccard distance
- Rarefaction curves to assess sampling completeness

Step 4: Statistical comparison of communities
- Compare diversity between groups (e.g., polluted vs pristine)
- Use statsmodels for:
  * Mann-Whitney or Kruskal-Wallis tests (alpha diversity)
  * PERMANOVA for beta diversity (adonis test)
  * LEfSe for differential abundance testing
- Identify taxa enriched or depleted in each condition

Step 5: Build phylogenetic tree with ETE Toolkit
- Extract 16S rRNA sequences (or marker genes)
- Align sequences (MUSCLE/MAFFT equivalent)
- Build phylogenetic tree (neighbor-joining or maximum likelihood)
- Visualize tree colored by sample or environment
- Root tree with outgroup

Step 6: Co-occurrence network analysis
- Calculate pairwise correlations between taxa
- Use Spearman correlation to identify co-occurrence patterns
- Filter significant correlations (p < 0.01, |r| > 0.6)
- Build co-occurrence network with NetworkX
- Identify modules (communities of co-occurring taxa)
- Calculate network topology metrics
- Visualize network (nodes = taxa, edges = correlations)

Step 7: Functional annotation
- Assemble contigs from reads (if performing assembly)
- Predict genes with Prodigal-like tools
- Annotate genes using UniProt and KEGG
- Map proteins to KEGG pathways
- Generate functional profile:
  * Abundance of metabolic pathways
  * Key enzymes (nitrification, denitrification, methanogenesis)
  * Antibiotic resistance genes
  * Virulence factors

Step 8: Functional diversity analysis
- Compare functional profiles between samples
- Calculate pathway richness and evenness
- Identify enriched pathways with statistical testing
- Link taxonomy to function:
  * Which taxa contribute to which functions?
  * Use shotgun data to assign functions to taxa

Step 9: Search ENA for related environmental samples
- Query ENA for metagenomic studies from similar environments
- Download and compare to own samples
- Place samples in context of global microbiome diversity
- Identify unique vs ubiquitous taxa

Step 10: Environmental parameter correlation
- Correlate community composition with metadata:
  * Temperature, pH, salinity
  * Nutrient concentrations (N, P)
  * Pollutant levels (heavy metals, hydrocarbons)
- Use Mantel test to correlate distance matrices
- Identify environmental drivers of community structure

Step 11: Biomarker discovery
- Identify taxa or pathways that correlate with environmental condition
- Use Random Forest to find predictive features
- Validate biomarkers:
  * Sensitivity and specificity
  * Cross-validation across samples
- Propose taxa as bioindicators of environmental health

Step 12: Generate environmental microbiome report
- Taxonomic composition bar charts (stacked by phylum/class)
- Alpha and beta diversity plots (boxplots, PCoA)
- Phylogenetic tree with environmental context
- Co-occurrence network visualization
- Functional pathway heatmaps
- Environmental correlation plots
- Statistical comparison tables
- Biological interpretation:
  * Dominant taxa and their ecological roles
  * Functional potential of the community
  * Environmental factors shaping the microbiome
  * Biomarker taxa for monitoring
- Recommendations:
  * Biomarkers for environmental monitoring
  * Functional guilds for restoration
  * Further sampling or sequencing strategies
- Export comprehensive PDF report

Expected Output:
- Taxonomic profiles for all samples
- Diversity metrics and statistical comparisons
- Phylogenetic tree
- Co-occurrence network
- Functional annotation and pathway analysis
- Comprehensive microbiome report

Infectious Disease Research

Example 16: Antimicrobial Resistance Surveillance and Prediction

Objective: Track antimicrobial resistance trends and predict resistance phenotypes from genomic data.

Skills Used:

  • biopython - Sequence analysis
  • pysam - Genome assembly analysis
  • ena-database - Public genomic data
  • uniprot-database - Resistance protein annotation
  • gene-database - Resistance gene catalogs
  • etetoolkit - Phylogenetic analysis
  • scikit-learn - Resistance prediction
  • networkx - Transmission networks
  • statsmodels - Trend analysis
  • matplotlib - Epidemiological plots

Workflow:

Step 1: Collect bacterial genome sequences
- Isolates from hospital surveillance program
- Load FASTA assemblies with BioPython
- Basic QC:
  * Assess assembly quality (N50, completeness)
  * Estimate genome size and coverage
  * Remove contaminated assemblies

Step 2: Species identification and MLST typing
- Perform in silico MLST (multi-locus sequence typing)
- Extract housekeeping gene sequences
- Assign sequence types (ST)
- Classify isolates into clonal complexes
- Identify high-risk clones (e.g., ST131 E. coli, ST258 K. pneumoniae)

Step 3: Antimicrobial resistance (AMR) gene detection
- Query NCBI Gene and UniProt for AMR gene databases
- Screen assemblies for resistance genes:
  * Beta-lactamases (blaTEM, blaCTX-M, blaKPC, blaNDM)
  * Aminoglycoside resistance (aac, aph, ant)
  * Fluoroquinolone resistance (gyrA, parC mutations)
  * Colistin resistance (mcr-1 to mcr-10)
  * Efflux pumps
- Calculate gene presence/absence matrix

Step 4: Resistance mechanism annotation
- Map detected genes to resistance classes:
  * Enzymatic modification (e.g., beta-lactamases)
  * Target modification (e.g., ribosomal methylation)
  * Target mutation (e.g., fluoroquinolone resistance)
  * Efflux pumps
- Query UniProt for detailed mechanism descriptions
- Link genes to antibiotic classes affected

Step 5: Build phylogenetic tree with ETE Toolkit
- Extract core genome SNPs
- Concatenate SNP alignments
- Build maximum likelihood tree
- Root with outgroup or midpoint rooting
- Annotate tree with:
  * Resistance profiles
  * Sequence types
  * Collection date and location

Step 6: Genotype-phenotype correlation
- Match genomic data with phenotypic susceptibility testing
- For each antibiotic, correlate:
  * Presence of resistance genes with MIC values
  * Target mutations with resistance phenotype
- Calculate sensitivity/specificity of genetic markers
- Identify discordant cases (false positives/negatives)

Step 7: Machine learning resistance prediction
- Train classification models with scikit-learn:
  * Features: presence/absence of resistance genes + mutations
  * Target: resistance phenotype (susceptible/intermediate/resistant)
  * Models: Logistic Regression, Random Forest, Gradient Boosting
- Train separate models for each antibiotic
- Cross-validate (stratified 5-fold)
- Calculate accuracy, precision, recall, F1 score
- Feature importance: which genes are most predictive?

Step 8: Temporal trend analysis
- Track resistance rates over time
- Use statsmodels for:
  * Mann-Kendall trend test
  * Joinpoint regression (identify change points)
  * Forecast future resistance rates (ARIMA)
- Analyze trends for each antibiotic class
- Identify emerging resistance mechanisms

Step 9: Transmission network inference
- Identify closely related isolates (< 10 SNPs difference)
- Build transmission network with NetworkX:
  * Nodes: isolates
  * Edges: putative transmission links
- Incorporate temporal and spatial data
- Identify outbreak clusters
- Detect super-spreaders (high degree nodes)
- Analyze network topology

Step 10: Search ENA for global context
- Query ENA for same species from other regions/countries
- Download representative genomes
- Integrate into phylogenetic analysis
- Assess whether local isolates are globally distributed clones
- Identify region-specific vs international resistance genes

Step 11: Plasmid and mobile element analysis
- Identify plasmid contigs
- Detect insertion sequences and transposons
- Track mobile genetic elements carrying resistance genes
- Identify conjugative plasmids facilitating horizontal gene transfer
- Build plasmid similarity networks

Step 12: Generate AMR surveillance report
- Summary statistics:
  * Number of isolates by species, ST, location
  * Resistance rates for each antibiotic
- Phylogenetic tree annotated with resistance profiles
- Temporal trend plots (resistance % over time)
- Transmission network visualizations
- Prediction model performance metrics
- Heatmap: resistance genes by isolate
- Geographic distribution map (if spatial data available)
- Interpretation:
  * Predominant resistance mechanisms
  * High-risk clones circulating
  * Temporal trends and emerging threats
  * Transmission clusters and outbreaks
- Recommendations:
  * Infection control measures for clusters
  * Antibiotic stewardship priorities
  * Resistance genes to monitor
  * Laboratories to perform confirmatory testing
- Export comprehensive PDF for public health reporting

Expected Output:
- AMR gene profiles for all isolates
- Phylogenetic tree with resistance annotations
- Temporal trends in resistance rates
- ML models for resistance prediction from genomes
- Transmission networks
- Comprehensive AMR surveillance report for public health

Multi-Omics Integration

Example 17: Integrative Analysis of Cancer Multi-Omics Data

Objective: Integrate genomics, transcriptomics, proteomics, and clinical data to identify cancer subtypes and therapeutic strategies.

Skills Used:

  • pydeseq2 - RNA-seq DE analysis
  • pysam - Variant calling
  • ensembl-database - Gene annotation
  • cosmic-database - Cancer mutations
  • string-database - Protein interactions
  • reactome-database - Pathway analysis
  • opentargets-database - Drug targets
  • scikit-learn - Clustering and classification
  • torch_geometric - Graph neural networks
  • umap-learn - Dimensionality reduction
  • statsmodels - Survival analysis
  • pymoo - Multi-objective optimization

Workflow:

Step 1: Load and preprocess genomic data (WES/WGS)
- Parse VCF files with pysam
- Filter high-quality variants (QUAL > 30, DP > 20)
- Annotate with Ensembl VEP (missense, nonsense, frameshift)
- Query COSMIC for known cancer mutations
- Create mutation matrix: samples × genes (binary: mutated or not)
- Focus on cancer genes from COSMIC Cancer Gene Census

Step 2: Process transcriptomic data (RNA-seq)
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare tumor vs normal (if paired samples available)
- Normalize counts (TPM or FPKM)
- Identify highly variable genes
- Create expression matrix: samples × genes (log2 TPM)

Step 3: Load proteomic data (Mass spec)
- Protein abundance matrix from LC-MS/MS
- Normalize protein abundances (median normalization)
- Log2-transform
- Filter proteins detected in < 50% of samples
- Create protein matrix: samples × proteins

Step 4: Load clinical data
- Demographics: age, sex, race
- Tumor characteristics: stage, grade, histology
- Treatment: surgery, chemo, radiation, targeted therapy
- Outcome: overall survival (OS), progression-free survival (PFS)
- Response: complete/partial response, stable/progressive disease

Step 5: Data integration and harmonization
- Match sample IDs across omics layers
- Ensure consistent gene/protein identifiers
- Handle missing data:
  * Impute with KNN or median (for moderate missingness)
  * Remove features with > 50% missing
- Create multi-omics data structure (dictionary of matrices)

Step 6: Multi-omics dimensionality reduction
- Concatenate all omics features (genes + proteins + mutations)
- Apply UMAP with umap-learn for visualization
- Alternative: PCA or t-SNE
- Visualize samples in 2D space colored by:
  * Histological subtype
  * Stage
  * Survival (high vs low)
- Identify patterns or clusters

Step 7: Unsupervised clustering to identify subtypes
- Perform consensus clustering with scikit-learn
- Test k = 2 to 10 clusters
- Evaluate cluster stability and optimal k
- Assign samples to clusters (subtypes)
- Visualize clustering in UMAP space

Step 8: Characterize molecular subtypes
For each subtype:
- Differential expression analysis:
  * Compare subtype vs all others with PyDESeq2
  * Extract top differentially expressed genes and proteins
- Mutation enrichment:
  * Fisher's exact test for each gene
  * Identify subtype-specific mutations
- Pathway enrichment:
  * Query Reactome for enriched pathways
  * Query KEGG for metabolic pathway differences
  * Identify hallmark biological processes

Step 9: Build protein-protein interaction networks
- Query STRING database for interactions among:
  * Differentially expressed proteins
  * Products of mutated genes
- Construct PPI network with NetworkX
- Identify network modules (community detection)
- Calculate centrality metrics to find hub proteins
- Overlay fold changes on network for visualization

Step 10: Survival analysis by subtype
- Use statsmodels or lifelines for survival analysis
- Kaplan-Meier curves for each subtype
- Log-rank test for significance
- Cox proportional hazards model:
  * Covariates: subtype, stage, age, treatment
  * Estimate hazard ratios
- Identify prognostic subtypes

Step 11: Predict therapeutic response
- Train machine learning models with scikit-learn:
  * Features: multi-omics data
  * Target: response to specific therapy (responder/non-responder)
  * Models: Random Forest, XGBoost, SVM
- Cross-validation to assess performance
- Identify features predictive of response
- Calculate AUC and feature importance

Step 12: Graph neural network for integrated prediction
- Build heterogeneous graph with Torch Geometric:
  * Nodes: samples, genes, proteins, pathways
  * Edges: gene-protein, protein-protein, gene-pathway
  * Node features: expression, mutation status
- Train GNN to predict:
  * Subtype classification
  * Survival risk
  * Treatment response
- Extract learned embeddings for interpretation

Step 13: Identify therapeutic targets with Open Targets
- For each subtype, query Open Targets:
  * Input: upregulated genes/proteins
  * Extract target-disease associations
  * Prioritize by tractability score
- Search for FDA-approved drugs targeting identified proteins
- Identify clinical trials for relevant targets
- Propose subtype-specific therapeutic strategies

Step 14: Multi-objective optimization of treatment strategies
- Use PyMOO to optimize treatment selection:
  * Objectives:
    1. Maximize predicted response probability
    2. Minimize predicted toxicity
    3. Minimize cost
  * Constraints: patient eligibility, drug availability
- Generate Pareto-optimal treatment strategies
- Personalized treatment recommendations per patient

Step 15: Generate comprehensive multi-omics report
- Sample clustering and subtype assignments
- UMAP visualization colored by subtype, survival, mutations
- Subtype characterization:
  * Molecular signatures (genes, proteins, mutations)
  * Enriched pathways
  * PPI networks
- Kaplan-Meier survival curves by subtype
- ML model performance (AUC, confusion matrices)
- Feature importance plots
- Therapeutic target tables with supporting evidence
- Personalized treatment recommendations
- Clinical implications:
  * Prognostic biomarkers
  * Predictive biomarkers for therapy selection
  * Novel drug targets
- Export publication-quality PDF with all figures and tables

Expected Output:
- Integrated multi-omics dataset
- Cancer subtype classification
- Molecular characterization of subtypes
- Survival analysis and prognostic markers
- Predictive models for treatment response
- Therapeutic target identification
- Personalized treatment strategies
- Comprehensive integrative genomics report

Experimental Physics & Data Analysis

Example 18: Analysis of Particle Physics Detector Data

Objective: Analyze experimental data from particle detector to identify signal events and measure physical constants.

Skills Used:

  • astropy - Units and constants
  • sympy - Symbolic mathematics
  • scipy - Statistical analysis
  • scikit-learn - Classification
  • stable-baselines3 - Reinforcement learning for optimization
  • matplotlib - Visualization
  • seaborn - Statistical plots
  • statsmodels - Hypothesis testing
  • dask - Large-scale data processing
  • vaex - Out-of-core dataframes

Workflow:

Step 1: Load and inspect detector data
- Load ROOT files or HDF5 with raw detector signals
- Use Vaex for out-of-core processing (TBs of data)
- Inspect data structure: event IDs, timestamps, detector channels
- Extract key observables:
  * Energy deposits in calorimeters
  * Particle trajectories from tracking detectors
  * Time-of-flight measurements
  * Trigger information

Step 2: Apply detector calibration and corrections
- Load calibration constants
- Apply energy calibrations to convert ADC to physical units
- Correct for detector efficiency variations
- Apply geometric corrections (alignment)
- Use Astropy units for unit conversions (eV, GeV, MeV)
- Account for dead time and detector acceptance

Step 3: Event reconstruction
- Cluster energy deposits to form particle candidates
- Reconstruct particle trajectories (tracks)
- Match tracks to calorimeter clusters
- Calculate invariant masses for particle identification
- Compute momentum and energy for each particle
- Use Dask for parallel processing across events

Step 4: Event selection and filtering
- Define signal region based on physics hypothesis
- Apply quality cuts:
  * Track quality (chi-squared, number of hits)
  * Fiducial volume cuts
  * Timing cuts (beam window)
  * Particle identification cuts
- Estimate trigger efficiency
- Calculate event weights for corrections

Step 5: Background estimation
- Identify background sources:
  * Cosmic rays
  * Beam-related backgrounds
  * Detector noise
  * Physics backgrounds (non-signal processes)
- Simulate backgrounds using Monte Carlo (if available)
- Estimate background from data in control regions
- Use sideband subtraction method

Step 6: Signal extraction
- Fit invariant mass distributions to extract signal
- Use scipy for likelihood fitting:
  * Signal model: Gaussian or Breit-Wigner
  * Background model: polynomial or exponential
  * Combined fit with maximum likelihood
- Calculate signal significance (S/√B or Z-score)
- Estimate systematic uncertainties

Step 7: Machine learning event classification
- Train classifier with scikit-learn to separate signal from background
- Features: kinematic variables, topology, detector response
- Models: Boosted Decision Trees (XGBoost), Neural Networks
- Cross-validate with k-fold CV
- Optimize selection criteria using ROC curves
- Calculate signal efficiency and background rejection

Step 8: Reinforcement learning for trigger optimization
- Use Stable-Baselines3 to optimize trigger thresholds
- Environment: detector simulator
- Action: adjust trigger thresholds
- Reward: maximize signal efficiency while controlling rate
- Train PPO or SAC agent
- Validate on real data

Step 9: Calculate physical observables
- Measure cross-sections:
  * σ = N_signal / (ε × L × BR)
  * N_signal: number of signal events
  * ε: detection efficiency
  * L: integrated luminosity
  * BR: branching ratio
- Use Sympy for symbolic error propagation
- Calculate with Astropy for proper unit handling

Step 10: Statistical analysis and hypothesis testing
- Perform hypothesis tests with statsmodels:
  * Likelihood ratio test for signal vs background-only
  * Calculate p-values and significance levels
  * Set confidence limits (CLs method)
- Bayesian analysis for parameter estimation
- Calculate confidence intervals and error bands

Step 11: Systematic uncertainty evaluation
- Identify sources of systematic uncertainty:
  * Detector calibration uncertainties
  * Background estimation uncertainties
  * Theoretical uncertainties (cross-sections, PDFs)
  * Monte Carlo modeling uncertainties
- Propagate uncertainties through analysis chain
- Combine statistical and systematic uncertainties
- Present as error budget

Step 12: Create comprehensive physics report
- Event displays showing candidate signal events
- Kinematic distributions (momentum, energy, angles)
- Invariant mass plots with fitted signal
- ROC curves for ML classifiers
- Cross-section measurements with error bars
- Comparison with theoretical predictions
- Systematic uncertainty breakdown
- Statistical significance calculations
- Interpretation:
  * Consistency with Standard Model
  * Constraints on new physics parameters
  * Discovery potential or exclusion limits
- Recommendations:
  * Detector improvements
  * Additional data needed
  * Future analysis strategies
- Export publication-ready PDF formatted for physics journal

Expected Output:
- Reconstructed physics events
- Signal vs background classification
- Measured cross-sections and branching ratios
- Statistical significance of observations
- Systematic uncertainty analysis
- Comprehensive experimental physics paper

Chemical Engineering & Process Optimization

Example 19: Optimization of Chemical Reactor Design and Operation

Objective: Design and optimize a continuous chemical reactor for maximum yield and efficiency while meeting safety and economic constraints.

Skills Used:

  • sympy - Symbolic equations and reaction kinetics
  • scipy - Numerical integration and optimization
  • pymoo - Multi-objective optimization
  • simpy - Process simulation
  • pymc - Bayesian parameter estimation
  • scikit-learn - Process modeling
  • stable-baselines3 - Real-time control optimization
  • matplotlib - Process diagrams
  • reportlab - Engineering reports

Workflow:

Step 1: Define reaction system and kinetics
- Chemical reaction: A + B → C + D
- Use Sympy to define symbolic rate equations:
  * Arrhenius equation: k = A × exp(-Ea/RT)
  * Rate law: r = k × [A]^α × [B]^β
- Define material and energy balances symbolically
- Include equilibrium constants and thermodynamics
- Account for side reactions and byproducts

Step 2: Develop reactor model
- Select reactor type: CSTR, PFR, batch, or semi-batch
- Write conservation equations:
  * Mass balance: dC/dt = (F_in × C_in - F_out × C)/V + r
  * Energy balance: ρCp × dT/dt = Q - ΔH_rxn × r × V
  * Momentum balance (pressure drop)
- Include heat transfer correlations
- Model mixing and mass transfer limitations

Step 3: Parameter estimation with PyMC
- Load experimental data from pilot reactor
- Bayesian inference to estimate kinetic parameters:
  * Pre-exponential factor (A)
  * Activation energy (Ea)
  * Reaction orders (α, β)
- Use MCMC sampling with PyMC
- Incorporate prior knowledge from literature
- Calculate posterior distributions and credible intervals
- Assess parameter uncertainty and correlation

Step 4: Model validation
- Simulate reactor with estimated parameters using scipy.integrate
- Compare predictions with experimental data
- Calculate goodness of fit (R², RMSE)
- Perform sensitivity analysis:
  * Which parameters most affect yield?
  * Identify critical operating conditions
- Refine model if needed

Step 5: Machine learning surrogate model
- Train fast surrogate model with scikit-learn
- Generate training data from detailed model (1000+ runs)
- Features: T, P, residence time, feed composition, catalyst loading
- Target: yield, selectivity, conversion
- Models: Gaussian Process Regression, Random Forest
- Validate surrogate accuracy (R² > 0.95)
- Use for rapid optimization

Step 6: Single-objective optimization
- Maximize yield with scipy.optimize:
  * Decision variables: T, P, feed ratio, residence time
  * Objective: maximize Y = (moles C produced) / (moles A fed)
  * Constraints:
    - Temperature: 300 K ≤ T ≤ 500 K (safety)
    - Pressure: 1 bar ≤ P ≤ 50 bar (equipment limits)
    - Residence time: 1 min ≤ τ ≤ 60 min
    - Conversion: X_A ≥ 90%
- Use Sequential Least Squares Programming (SLSQP)
- Identify optimal operating point

Step 7: Multi-objective optimization with PyMOO
- Competing objectives:
  1. Maximize product yield
  2. Minimize energy consumption (heating/cooling)
  3. Minimize operating cost (raw materials, utilities)
  4. Maximize reactor productivity (throughput)
- Constraints:
  - Safety: temperature and pressure limits
  - Environmental: waste production limits
  - Economic: minimum profitability
- Run NSGA-II or NSGA-III
- Generate Pareto front of optimal solutions
- Select operating point based on preferences

Step 8: Dynamic process simulation with SimPy
- Model complete plant:
  * Reactors, separators, heat exchangers
  * Pumps, compressors, valves
  * Storage tanks and buffers
- Simulate startup, steady-state, and shutdown
- Include disturbances:
  * Feed composition variations
  * Equipment failures
  * Demand fluctuations
- Evaluate dynamic stability
- Calculate time to steady state

Step 9: Control system design
- Design feedback control loops:
  * Temperature control (PID controller)
  * Pressure control
  * Flow control
  * Level control
- Tune PID parameters using Ziegler-Nichols or optimization
- Implement cascade control for improved performance
- Add feedforward control for disturbance rejection

Step 10: Reinforcement learning for advanced control
- Use Stable-Baselines3 to train RL agent:
  * Environment: reactor simulation (SimPy-based)
  * State: T, P, concentrations, flow rates
  * Actions: adjust setpoints, flow rates, heating/cooling
  * Reward: +yield -energy cost -deviation from setpoint
- Train PPO or TD3 agent
- Compare with conventional PID control
- Evaluate performance under disturbances
- Implement model-free adaptive control

Step 11: Economic analysis
- Calculate capital costs (CAPEX):
  * Reactor vessel cost (function of size, pressure rating)
  * Heat exchanger costs
  * Pumps and instrumentation
  * Installation costs
- Calculate operating costs (OPEX):
  * Raw materials (A, B, catalyst)
  * Utilities (steam, cooling water, electricity)
  * Labor and maintenance
- Revenue from product sales
- Calculate economic metrics:
  * Net present value (NPV)
  * Internal rate of return (IRR)
  * Payback period
  * Levelized cost of production

Step 12: Safety analysis
- Identify hazards:
  * Exothermic runaway reactions
  * Pressure buildup
  * Toxic or flammable materials
- Perform HAZOP-style analysis
- Calculate safe operating limits:
  * Maximum temperature of synthesis reaction (MTSR)
  * Adiabatic temperature rise
  * Relief valve sizing
- Design emergency shutdown systems
- Implement safety interlocks

Step 13: Uncertainty quantification
- Propagate parameter uncertainties from PyMC:
  * How does kinetic parameter uncertainty affect yield?
  * Monte Carlo simulation with parameter distributions
- Evaluate robustness of optimal design
- Calculate confidence intervals on economic metrics
- Identify critical uncertainties for further study

Step 14: Generate comprehensive engineering report
- Executive summary of project objectives and results
- Process flow diagram (PFD) with material and energy streams
- Reaction kinetics and model equations
- Parameter estimation results with uncertainties
- Optimization results:
  * Pareto front for multi-objective optimization
  * Recommended operating conditions
  * Trade-off analysis
- Dynamic simulation results (startup curves, response to disturbances)
- Control system design and tuning
- Economic analysis with sensitivity to key assumptions
- Safety analysis and hazard mitigation
- Scale-up considerations:
  * Pilot to commercial scale
  * Heat and mass transfer limitations
  * Equipment sizing
- Recommendations:
  * Optimal reactor design (size, type, materials of construction)
  * Operating conditions for maximum profitability
  * Control strategy
  * Further experimental studies needed
- Technical drawings and P&ID (piping and instrumentation diagram)
- Export as professional engineering report (PDF)

Expected Output:
- Validated reactor model with parameter uncertainties
- Optimal reactor design and operating conditions
- Pareto-optimal solutions for multi-objective optimization
- Dynamic process simulation results
- Advanced control strategies (RL-based)
- Economic feasibility analysis
- Safety assessment
- Comprehensive chemical engineering design report

Summary

These examples demonstrate:

  1. Cross-domain applicability: Skills are useful across many scientific fields
  2. Skill integration: Complex workflows combine multiple databases, packages, and analysis methods
  3. Real-world relevance: Examples address actual research questions and clinical needs
  4. End-to-end workflows: From data acquisition to publication-ready reports
  5. Best practices: QC, statistical rigor, visualization, interpretation, and documentation

How to Use These Examples

  1. Adapt to your needs: Modify parameters, datasets, and objectives for your specific research question
  2. Combine skills creatively: Mix and match skills from different categories
  3. Follow the structure: Each example provides a clear step-by-step workflow
  4. Generate comprehensive output: Aim for publication-quality figures and professional reports
  5. Cite your sources: Always verify data and provide proper citations

Additional Notes

  • Always start with: "Always use available 'skills' when possible. Keep the output organized."
  • For complex projects, break into manageable steps and validate intermediate results
  • Save checkpoints and intermediate data files
  • Document parameters and decisions for reproducibility
  • Generate README files explaining methodology
  • Create PDFs for stakeholder communication

These examples showcase the power of combining the skills in this repository to tackle complex, real-world scientific challenges across multiple domains.