mirror of https://github.com/K-Dense-AI/claude-scientific-skills.git synced 2026-01-26 16:58:56 +08:00

Files

Timothy Kassis 5b7081cbff Add examples

2025-11-06 17:04:04 -08:00

89 KiB

Raw Permalink Blame History

Real-World Scientific Examples

This document provides comprehensive, practical examples demonstrating how to combine Claude Scientific Skills to solve real scientific problems across multiple domains.

📋 Table of Contents

Drug Discovery & Medicinal Chemistry
Cancer Genomics & Precision Medicine
Single-Cell Transcriptomics
Protein Structure & Function
Chemical Safety & Toxicology
Clinical Trial Analysis
Metabolomics & Systems Biology
Materials Science & Chemistry
Digital Pathology
Lab Automation & Protocol Design
Agricultural Genomics
Neuroscience & Brain Imaging
Environmental Microbiology
Infectious Disease Research
Multi-Omics Integration
Computational Chemistry & Synthesis
Clinical Research & Real-World Evidence
Experimental Physics & Data Analysis
Chemical Engineering & Process Optimization

Drug Discovery & Medicinal Chemistry

Example 1: Discovery of Novel EGFR Inhibitors for Lung Cancer

Objective: Identify novel small molecule inhibitors of EGFR with improved properties compared to existing drugs.

Skills Used:

chembl-database - Query bioactivity data
pubchem-database - Search compound libraries
rdkit - Analyze molecular properties
datamol - Generate analogs
diffdock - Molecular docking
alphafold-database - Retrieve protein structure
pubmed-database - Literature review
cosmic-database - Query mutations
deepchem - Property prediction
scientific-visualization - Create figures

Workflow:

# Always use available 'skills' when possible. Keep the output organized.

Step 1: Query ChEMBL for known EGFR inhibitors with high potency
- Search for compounds targeting EGFR (CHEMBL203)
- Filter: IC50 < 50 nM, pChEMBL value > 7
- Extract SMILES strings and activity data
- Export to DataFrame for analysis

Step 2: Analyze structure-activity relationships
- Load compounds into RDKit
- Calculate molecular descriptors (MW, LogP, TPSA, HBD, HBA)
- Generate Morgan fingerprints (radius=2, 2048 bits)
- Perform hierarchical clustering to identify scaffolds
- Visualize top scaffolds with activity annotations

Step 3: Identify resistance mutations from COSMIC
- Query COSMIC for EGFR mutations in lung cancer
- Focus on gatekeeper mutations (T790M, C797S)
- Extract mutation frequencies and clinical significance
- Cross-reference with literature in PubMed

Step 4: Retrieve EGFR structure from AlphaFold
- Download AlphaFold prediction for EGFR kinase domain
- Alternatively, use experimental structure from PDB (if available)
- Prepare structure for docking (add hydrogens, optimize)

Step 5: Generate novel analogs using datamol
- Select top 5 scaffolds from ChEMBL analysis
- Use scaffold decoration to generate 100 analogs per scaffold
- Apply Lipinski's Rule of Five filtering
- Ensure synthetic accessibility (SA score < 4)
- Check for PAINS and unwanted substructures

Step 6: Predict properties with DeepChem
- Train graph convolutional model on ChEMBL EGFR data
- Predict pIC50 for generated analogs
- Predict ADMET properties (solubility, permeability, hERG)
- Rank candidates by predicted potency and drug-likeness

Step 7: Virtual screening with DiffDock
- Perform molecular docking on top 50 candidates
- Dock into wild-type EGFR and T790M mutant
- Calculate binding energies and interaction patterns
- Identify compounds with favorable binding to both forms

Step 8: Search PubChem for commercial availability
- Query PubChem for top 10 candidates by InChI key
- Check supplier information and purchasing options
- Identify close analogs if exact matches unavailable

Step 9: Literature validation with PubMed
- Search for any prior art on top scaffolds
- Query: "[scaffold_name] AND EGFR AND inhibitor"
- Summarize relevant findings and potential liabilities

Step 10: Create comprehensive report
- Generate 2D structure visualizations of top hits
- Create scatter plots: MW vs LogP, TPSA vs potency
- Produce binding pose figures for top 3 compounds
- Generate table comparing properties to approved drugs (gefitinib, erlotinib)
- Write scientific summary with methodology, results, and recommendations
- Export to PDF with proper citations

Expected Output: 
- Ranked list of 10-20 novel EGFR inhibitor candidates
- Predicted activity and ADMET properties
- Docking poses and binding analysis
- Comprehensive scientific report with publication-quality figures

Example 2: Drug Repurposing for Rare Diseases

Objective: Identify FDA-approved drugs that could be repurposed for treating a rare metabolic disorder.

Skills Used:

drugbank-database - Query approved drugs
opentargets-database - Target-disease associations
string-database - Protein interactions
kegg-database - Pathway analysis
reactome-database - Pathway enrichment
clinicaltrials-database - Check ongoing trials
fda-database - Drug approvals and safety
networkx - Network analysis
literature-review - Systematic review

Workflow:

Step 1: Define disease pathway
- Query KEGG and Reactome for disease-associated pathways
- Identify key proteins and enzymes involved
- Map upstream and downstream pathway components

Step 2: Find protein-protein interactions
- Query STRING database for interaction partners
- Build protein interaction network around key disease proteins
- Identify hub proteins and bottlenecks using NetworkX
- Calculate centrality metrics (betweenness, closeness)

Step 3: Query Open Targets for druggable targets
- Search for targets associated with disease phenotype
- Filter by clinical precedence and tractability
- Prioritize targets with existing approved drugs

Step 4: Search DrugBank for drugs targeting identified proteins
- Query for approved drugs and their targets
- Filter by mechanism of action relevant to disease
- Retrieve drug properties and safety information

Step 5: Query FDA databases for safety profiles
- Check FDA adverse event database (FAERS)
- Review drug labels and black box warnings
- Assess risk-benefit for rare disease population

Step 6: Search ClinicalTrials.gov for prior repurposing attempts
- Query for disease name + drug names
- Check for failed trials (and reasons for failure)
- Identify ongoing trials that may compete

Step 7: Perform pathway enrichment analysis
- Map drug targets to disease pathways
- Calculate enrichment scores with Reactome
- Identify drugs affecting multiple pathway nodes

Step 8: Conduct systematic literature review
- Search PubMed for drug name + disease associations
- Include bioRxiv for recent unpublished findings
- Document any case reports or off-label use
- Use literature-review skill to generate comprehensive review

Step 9: Prioritize candidates
- Rank by: pathway relevance, safety profile, existing evidence
- Consider factors: oral availability, blood-brain barrier penetration
- Assess commercial viability and patent status

Step 10: Generate repurposing report
- Create network visualization of drug-target-pathway relationships
- Generate comparison table of top 5 candidates
- Write detailed rationale for each candidate
- Include mechanism of action diagrams
- Provide recommendations for preclinical validation
- Format as professional PDF with citations

Expected Output:
- Ranked list of 5-10 repurposing candidates
- Network analysis of drug-target-disease relationships
- Safety and efficacy evidence summary
- Repurposing strategy report with next steps

Cancer Genomics & Precision Medicine

Example 3: Clinical Variant Interpretation Pipeline

Objective: Analyze a patient's tumor sequencing data to identify actionable mutations and therapeutic recommendations.

Skills Used:

pysam - Parse VCF files
ensembl-database - Variant annotation
clinvar-database - Clinical significance
cosmic-database - Somatic mutations
gene-database - Gene information
uniprot-database - Protein impact
drugbank-database - Drug-gene associations
clinicaltrials-database - Matching trials
opentargets-database - Target validation
pubmed-database - Literature evidence
reportlab - Generate clinical report

Workflow:

Step 1: Parse and filter VCF file
- Use pysam to read tumor VCF
- Filter for high-quality variants (QUAL > 30, DP > 20)
- Extract variant positions, alleles, and VAF (variant allele frequency)
- Separate SNVs, indels, and structural variants

Step 2: Annotate variants with Ensembl
- Query Ensembl VEP API for functional consequences
- Classify variants: missense, nonsense, frameshift, splice site
- Extract transcript information and protein changes
- Identify canonical transcripts for each gene

Step 3: Query ClinVar for known pathogenic variants
- Search ClinVar by genomic coordinates
- Extract clinical significance classifications
- Note conflicting interpretations and review status
- Prioritize variants with "Pathogenic" or "Likely Pathogenic" labels

Step 4: Query COSMIC for somatic cancer mutations
- Search COSMIC for each variant
- Extract mutation frequency across cancer types
- Identify hotspot mutations (high recurrence)
- Note drug resistance mutations

Step 5: Retrieve gene information from NCBI Gene
- Get detailed gene descriptions
- Extract associated phenotypes and diseases
- Identify oncogene vs tumor suppressor classification
- Note gene function and biological pathways

Step 6: Assess protein-level impact with UniProt
- Query UniProt for protein domain information
- Map variants to functional domains (kinase domain, binding site)
- Check if variant affects active sites or protein stability
- Retrieve post-translational modification sites

Step 7: Search DrugBank for targetable alterations
- Query for drugs targeting mutated genes
- Filter for FDA-approved and investigational drugs
- Extract mechanism of action and indications
- Prioritize variants with approved targeted therapies

Step 8: Query Open Targets for target-disease associations
- Validate therapeutic hypotheses
- Assess target tractability scores
- Review clinical precedence for each gene-disease pair

Step 9: Search ClinicalTrials.gov for matching trials
- Build query with: cancer type + gene names + variants
- Filter for: recruiting status, phase II/III trials
- Extract trial eligibility criteria
- Note geographic locations and contact information

Step 10: Literature search for clinical evidence
- PubMed query: "[gene] AND [variant] AND [cancer type]"
- Focus on: case reports, clinical outcomes, resistance mechanisms
- Extract relevant prognostic or predictive information

Step 11: Classify variants by actionability
Tier 1: FDA-approved therapy for this variant
Tier 2: Clinical trial available for this variant
Tier 3: Therapy approved for variant in different cancer
Tier 4: Biological evidence but no approved therapy

Step 12: Generate clinical genomics report
- Executive summary of key findings
- Table of actionable variants with evidence levels
- Therapeutic recommendations with supporting evidence
- Clinical trial options with eligibility information
- Prognostic implications based on mutation profile
- References to guidelines (NCCN, ESMO, AMP/ASCO/CAP)
- Generate professional PDF using ReportLab

Expected Output:
- Annotated variant list with clinical significance
- Tiered list of actionable mutations
- Therapeutic recommendations with evidence levels
- Matching clinical trials
- Comprehensive clinical genomics report (PDF)

Example 4: Cancer Subtype Classification from Gene Expression

Objective: Classify breast cancer subtypes using RNA-seq data and identify subtype-specific therapeutic vulnerabilities.

Skills Used:

pydeseq2 - Differential expression
scanpy - Clustering and visualization
scikit-learn - Machine learning classification
gene-database - Gene annotation
reactome-database - Pathway analysis
opentargets-database - Drug targets
pubmed-database - Literature validation
matplotlib - Visualization
seaborn - Heatmaps

Workflow:

Step 1: Load and preprocess RNA-seq data
- Load count matrix (genes × samples)
- Filter low-expression genes (mean counts < 10)
- Normalize with DESeq2 size factors
- Apply variance-stabilizing transformation (VST)

Step 2: Classify samples using PAM50 genes
- Query NCBI Gene for PAM50 classifier gene list
- Extract expression values for PAM50 genes
- Train Random Forest classifier on labeled training data
- Predict subtypes: Luminal A, Luminal B, HER2+, Basal, Normal-like
- Validate with published markers (ESR1, PGR, ERBB2, MKI67)

Step 3: Perform differential expression for each subtype
- Use PyDESeq2 to compare each subtype vs all others
- Apply multiple testing correction (FDR < 0.05)
- Filter by log2 fold change (|LFC| > 1.5)
- Identify subtype-specific signature genes

Step 4: Annotate differentially expressed genes
- Query NCBI Gene for detailed annotations
- Classify as oncogene, tumor suppressor, or other
- Extract biological process and molecular function terms

Step 5: Pathway enrichment analysis
- Submit gene lists to Reactome API
- Identify enriched pathways for each subtype (p < 0.01)
- Focus on druggable pathways (kinase signaling, metabolism)
- Compare pathway profiles across subtypes

Step 6: Identify therapeutic targets with Open Targets
- Query Open Targets for each upregulated gene
- Filter by tractability score > 5
- Prioritize targets with clinical precedence
- Extract associated drugs and development phase

Step 7: Create comprehensive visualization
- Generate UMAP projection of all samples colored by subtype
- Create heatmap of PAM50 genes across subtypes
- Produce volcano plots for each subtype comparison
- Generate pathway enrichment dot plots
- Create drug target-pathway network diagrams

Step 8: Literature validation
- Search PubMed for each predicted therapeutic target
- Query: "[gene] AND [subtype] AND breast cancer AND therapy"
- Summarize clinical evidence and ongoing trials
- Note any resistance mechanisms reported

Step 9: Generate subtype-specific recommendations
For each subtype:
- List top 5 differentially expressed genes
- Identify enriched biological pathways
- Recommend therapeutic strategies based on vulnerabilities
- Cite supporting evidence from literature

Step 10: Create comprehensive report
- Classification results with confidence scores
- Differential expression tables for each subtype
- Pathway enrichment summaries
- Therapeutic target recommendations
- Publication-quality figures
- Export to PDF with citations

Expected Output:
- Sample classification into molecular subtypes
- Subtype-specific gene signatures
- Pathway enrichment profiles
- Prioritized therapeutic targets for each subtype
- Scientific report with visualizations and recommendations

Single-Cell Transcriptomics

Example 5: Single-Cell Atlas of Tumor Microenvironment

Objective: Characterize immune cell populations in tumor microenvironment and identify immunotherapy biomarkers.

Skills Used:

scanpy - Single-cell analysis
scvi-tools - Batch correction and integration
cellxgene-census - Reference data
gene-database - Cell type markers
anndata - Data structure
arboreto - Gene regulatory networks
pytorch-lightning - Deep learning
matplotlib - Visualization
statistical-analysis - Hypothesis testing

Workflow:

Step 1: Load and QC 10X Genomics data
- Use Scanpy to read 10X h5 files
- Calculate QC metrics: n_genes, n_counts, pct_mitochondrial
- Identify mitochondrial genes (MT- prefix)
- Filter cells: 200 < n_genes < 5000, pct_mt < 20%
- Filter genes: expressed in at least 10 cells
- Document filtering criteria and cell retention rate

Step 2: Normalize and identify highly variable genes
- Normalize to 10,000 counts per cell
- Log-transform data (log1p)
- Store raw counts in adata.raw
- Identify 3,000 highly variable genes
- Regress out technical variation (n_counts, pct_mt)
- Scale to unit variance, clip at 10 standard deviations

Step 3: Integrate with reference atlas using scVI
- Download reference tumor microenvironment data from Cellxgene Census
- Train scVI model on combined dataset for batch correction
- Use scVI latent representation for downstream analysis
- Generate batch-corrected expression matrix

Step 4: Dimensionality reduction and clustering
- Compute neighborhood graph (n_neighbors=15, n_pcs=50)
- Calculate UMAP embedding for visualization
- Perform Leiden clustering at multiple resolutions (0.3, 0.5, 0.8)
- Select optimal resolution based on silhouette score

Step 5: Identify cell type markers
- Run differential expression for each cluster (Wilcoxon test)
- Calculate marker scores (log fold change, p-value, pct expressed)
- Query NCBI Gene for canonical immune cell markers:
  * T cells: CD3D, CD3E, CD4, CD8A
  * B cells: CD19, MS4A1 (CD20), CD79A
  * Myeloid: CD14, CD68, CD163
  * NK cells: NKG7, GNLY, NCAM1
  * Dendritic: CD1C, CLEC9A, LILRA4

Step 6: Annotate cell types
- Assign cell type labels based on marker expression
- Refine annotations with CellTypist or manual curation
- Identify T cell subtypes: CD4+, CD8+, Tregs, exhausted T cells
- Characterize myeloid cells: M1/M2 macrophages, dendritic cells
- Create cell type proportion tables by sample/condition

Step 7: Identify tumor-specific features
- Compare tumor samples vs normal tissue (if available)
- Identify expanded T cell clones (high proliferation markers)
- Detect exhausted T cells (PDCD1, CTLA4, LAG3, HAVCR2)
- Characterize immunosuppressive populations (Tregs, M2 macrophages)

Step 8: Gene regulatory network inference
- Use Arboreto/GRNBoost2 on each major cell type
- Identify transcription factors driving cell states
- Focus on exhaustion TFs: TOX, TCF7, EOMES
- Build regulatory networks for visualization

Step 9: Statistical analysis of cell proportions
- Calculate cell type frequencies per sample
- Test for significant differences between groups (responders vs non-responders)
- Use statistical-analysis skill for appropriate tests (t-test, Mann-Whitney)
- Calculate effect sizes and confidence intervals

Step 10: Biomarker discovery for immunotherapy response
- Correlate cell type abundances with clinical response
- Identify gene signatures associated with response
- Test signatures: T cell exhaustion, antigen presentation, inflammation
- Validate with published immunotherapy response signatures

Step 11: Create comprehensive visualizations
- UMAP plots colored by: cell type, sample, treatment, key genes
- Dot plots of canonical markers across cell types
- Cell type proportion bar plots by condition
- Heatmap of top differentially expressed genes per cell type
- Gene regulatory network diagrams
- Volcano plots for differentially abundant cell types

Step 12: Generate scientific report
- Methods: QC, normalization, batch correction, clustering
- Results: Cell type composition, differential abundance, markers
- Biomarker analysis: Predictive signatures and validation
- High-quality figures suitable for publication
- Export processed h5ad file and PDF report

Expected Output:
- Annotated single-cell atlas with cell type labels
- Cell type composition analysis
- Biomarker signatures for immunotherapy response
- Gene regulatory networks for key cell states
- Comprehensive report with publication-quality figures

Protein Structure & Function

Example 6: Structure-Based Design of Protein-Protein Interaction Inhibitors

Objective: Design small molecules to disrupt a therapeutically relevant protein-protein interaction.

Skills Used:

alphafold-database - Protein structures
pdb-database - Experimental structures
uniprot-database - Protein information
biopython - Structure analysis
pyrosetta - Protein design (if available)
rdkit - Chemical library generation
diffdock - Molecular docking
zinc-database - Screening library
deepchem - Property prediction
pymol - Visualization (external)

Workflow:

Step 1: Retrieve protein structures
- Query AlphaFold Database for both proteins in the interaction
- Download PDB files and confidence scores
- If available, get experimental structures from PDB database
- Compare AlphaFold predictions with experimental structures (if any)

Step 2: Analyze protein interaction interface
- Load structures with BioPython
- Identify interface residues (distance < 5Å between proteins)
- Calculate interface area and binding energy contribution
- Identify hot spot residues (key for binding)
- Map to UniProt to get functional annotations

Step 3: Characterize binding pocket
- Identify cavities at the protein-protein interface
- Calculate pocket volume and surface area
- Assess druggability: depth, hydrophobicity, shape
- Identify hydrogen bond donors/acceptors
- Note any known allosteric sites

Step 4: Query UniProt for known modulators
- Search UniProt for both proteins
- Extract information on known inhibitors or modulators
- Review PTMs that affect interaction
- Check disease-associated mutations in interface

Step 5: Search ZINC15 for fragment library
- Query ZINC for fragments matching pocket criteria:
  * Molecular weight: 150-300 Da
  * LogP: 0-3 (appropriate for PPI inhibitors)
  * Exclude PAINS and aggregators
- Download 1,000-5,000 fragment SMILES

Step 6: Virtual screening with fragment library
- Use DiffDock to dock fragments into interface pocket
- Rank by predicted binding affinity
- Identify fragments binding to hot spot residues
- Select top 50 fragments for elaboration

Step 7: Fragment elaboration with RDKit
- For each fragment hit, generate elaborated molecules:
  * Add substituents to core scaffold
  * Merge fragments binding to adjacent pockets
  * Apply medicinal chemistry filters
- Generate 20-50 analogs per fragment
- Filter by Lipinski's Ro5 and PPI-specific rules (MW 400-700)

Step 8: Second round of virtual screening
- Dock elaborated molecules with DiffDock
- Calculate binding energies and interaction patterns
- Prioritize molecules with:
  * Strong binding to hot spot residues
  * Multiple H-bonds and hydrophobic contacts
  * Favorable predicted ΔG

Step 9: Predict ADMET properties with DeepChem
- Train models on ChEMBL data
- Predict: solubility, permeability, hERG liability
- Filter for drug-like properties
- Rank by overall score (affinity + ADMET)

Step 10: Literature and patent search
- PubMed: "[protein A] AND [protein B] AND inhibitor"
- USPTO: Check for prior art on top scaffolds
- Assess freedom to operate
- Identify any reported PPI inhibitors for this target

Step 11: Prepare molecules for synthesis
- Assess synthetic accessibility (SA score < 4)
- Identify commercial building blocks
- Propose synthetic routes for top 10 candidates
- Calculate estimated synthesis cost

Step 12: Generate comprehensive design report
- Interface analysis with hot spot identification
- Fragment screening results
- Top 10 designed molecules with predicted properties
- Docking poses and interaction diagrams
- Synthetic accessibility assessment
- Comparison to known PPI inhibitors
- Recommendations for experimental validation
- Publication-quality figures and PDF report

Expected Output:
- Interface characterization and hot spot analysis
- Ranked library of designed PPI inhibitors
- Predicted binding modes and affinities
- ADMET property predictions
- Synthetic accessibility assessment
- Comprehensive drug design report

Chemical Safety & Toxicology

Example 7: Predictive Toxicology Assessment

Objective: Assess potential toxicity and safety liabilities of drug candidates before synthesis.

Skills Used:

rdkit - Molecular descriptors
deepchem - Toxicity prediction
chembl-database - Toxicity data
pubchem-database - Bioassay data
drugbank-database - Known drug toxicities
fda-database - Adverse events
hmdb-database - Metabolite prediction
scikit-learn - Classification models
shap - Model interpretability

Workflow:

Step 1: Calculate molecular descriptors
- Load candidate molecules with RDKit
- Calculate physicochemical properties:
  * MW, LogP, TPSA, rotatable bonds, H-bond donors/acceptors
  * Aromatic rings, sp3 fraction, formal charge
- Calculate structural alerts:
  * PAINS patterns
  * Toxic functional groups (nitroaromatics, epoxides, etc.)
  * Genotoxic alerts (Ames mutagenicity)

Step 2: Screen for known toxicophores
- Search for structural alerts using SMARTS patterns:
  * Michael acceptors
  * Aldehyde/ketone reactivity
  * Quinones and quinone-like structures
  * Thioureas and isocyanates
- Flag molecules with high-risk substructures

Step 3: Query ChEMBL for similar compounds with toxicity data
- Perform similarity search (Tanimoto > 0.7)
- Extract toxicity assay results:
  * Cytotoxicity (IC50 values)
  * Hepatotoxicity markers
  * Cardiotoxicity (hERG inhibition)
  * Genotoxicity (Ames test results)
- Analyze structure-toxicity relationships

Step 4: Search PubChem BioAssays for toxicity screening
- Query relevant assays:
  * Tox21 panel (cell viability, stress response, genotoxicity)
  * Liver toxicity assays
  * hERG channel inhibition
- Extract activity data for similar compounds
- Calculate hit rates for concerning assays

Step 5: Train toxicity prediction models with DeepChem
- Load Tox21 dataset from DeepChem
- Train graph convolutional models for:
  * Nuclear receptor signaling
  * Stress response pathways
  * Genotoxicity endpoints
- Validate models with cross-validation
- Predict toxicity for candidate molecules

Step 6: Predict hERG cardiotoxicity liability
- Train DeepChem model on hERG inhibition data from ChEMBL
- Predict IC50 for hERG channel
- Flag compounds with predicted IC50 < 10 μM
- Identify structural features associated with hERG liability

Step 7: Predict hepatotoxicity risk
- Train models on DILI (drug-induced liver injury) datasets
- Extract features: reactive metabolites, mitochondrial toxicity
- Predict hepatotoxicity risk class (low/medium/high)
- Use SHAP values to explain predictions

Step 8: Predict metabolic stability and metabolites
- Identify sites of metabolism using RDKit SMARTS patterns
- Predict CYP450 interactions
- Query HMDB for potential metabolite structures
- Assess if metabolites contain toxic substructures
- Predict metabolic stability (half-life)

Step 9: Check FDA adverse event database
- Query FAERS for approved drugs similar to candidates
- Extract common adverse events
- Identify target organ toxicities
- Calculate reporting odds ratios for serious events

Step 10: Literature review of toxicity mechanisms
- PubMed search: "[scaffold] AND (toxicity OR hepatotoxicity OR cardiotoxicity)"
- Identify mechanistic studies on similar compounds
- Note any case reports of adverse events
- Review preclinical and clinical safety data

Step 11: Assess ADME liabilities
- Predict solubility, permeability, plasma protein binding
- Identify potential drug-drug interaction risks
- Assess blood-brain barrier penetration (for CNS or non-CNS drugs)
- Evaluate metabolic stability

Step 12: Generate safety assessment report
- Executive summary of safety profile for each candidate
- Red flags: structural alerts, predicted toxicities
- Yellow flags: moderate concerns requiring testing
- Green light: acceptable predicted safety profile
- Comparison table of all candidates
- Recommendations for risk mitigation:
  * Structural modifications to reduce toxicity
  * Priority in vitro assays to run
  * Preclinical study design recommendations
- Comprehensive PDF report with:
  * Toxicophore analysis
  * Prediction model results with confidence
  * SHAP interpretation plots
  * Literature evidence
  * Risk assessment matrix

Expected Output:
- Toxicity predictions for all candidates
- Structural alert analysis
- hERG, hepatotoxicity, and genotoxicity risk scores
- Metabolite predictions
- Prioritized list with safety rankings
- Comprehensive toxicology assessment report

Clinical Trial Analysis

Example 8: Competitive Landscape Analysis for New Indication

Objective: Analyze the clinical trial landscape for a specific indication to inform development strategy.

Skills Used:

clinicaltrials-database - Trial registry
fda-database - Drug approvals
pubmed-database - Published results
drugbank-database - Approved drugs
opentargets-database - Target validation
polars - Data manipulation
matplotlib - Visualization
seaborn - Statistical plots
reportlab - Report generation

Workflow:

Step 1: Search ClinicalTrials.gov for all trials in indication
- Query: "[disease/indication]"
- Filter: All phases, all statuses
- Extract fields:
  * NCT ID, title, phase, status
  * Start date, completion date, enrollment
  * Intervention/drug names
  * Primary/secondary outcomes
  * Sponsor and collaborators
- Export to structured JSON/CSV

Step 2: Categorize trials by mechanism of action
- Extract drug names and intervention types
- Query DrugBank for mechanism of action
- Query Open Targets for target information
- Classify into categories:
  * Small molecules vs biologics
  * Target class (kinase inhibitor, antibody, etc.)
  * Novel vs repurposing

Step 3: Analyze trial phase progression
- Calculate success rates by phase (I → II, II → III)
- Identify terminated trials and reasons for termination
- Track time from phase I start to NDA submission
- Calculate median development timelines

Step 4: Search FDA database for recent approvals
- Query FDA drug approvals in the indication (last 10 years)
- Extract approval dates, indications, priority review status
- Note any accelerated approvals or breakthroughs
- Review FDA drug labels for safety information

Step 5: Extract outcome measures
- Compile all primary endpoints used
- Identify most common endpoints:
  * Survival (OS, PFS, DFS)
  * Response rates (ORR, CR, PR)
  * Biomarker endpoints
  * Patient-reported outcomes
- Note emerging or novel endpoints

Step 6: Analyze competitive dynamics
- Identify leading companies and their pipelines
- Map trials by phase for each major competitor
- Note partnership and licensing deals
- Assess crowded vs underserved patient segments

Step 7: Search PubMed for published trial results
- Query: "[NCT ID]" for each completed trial
- Extract published outcomes and conclusions
- Identify trends in efficacy and safety
- Note any unmet needs highlighted in discussions

Step 8: Analyze target validation evidence
- Query Open Targets for target-disease associations
- Extract genetic evidence scores
- Review tractability assessments
- Compare targets being pursued across trials

Step 9: Identify unmet needs and opportunities
- Analyze trial failures for common patterns
- Identify patient populations excluded from trials
- Note resistance mechanisms or limitations mentioned
- Assess gaps in current therapeutic approaches

Step 10: Perform temporal trend analysis
- Plot trial starts over time (by phase, mechanism)
- Identify increasing or decreasing interest in targets
- Correlate with publication trends and scientific advances
- Predict future trends in the space

Step 11: Create comprehensive visualizations
- Timeline of all trials (Gantt chart style)
- Phase distribution pie chart
- Mechanism of action breakdown
- Geographic distribution of trials
- Enrollment trends over time
- Success rate funnels (Phase I → II → III → Approval)
- Sponsor/company market share

Step 12: Generate competitive intelligence report
- Executive summary of competitive landscape
- Total number of active programs by phase
- Key players and their development stage
- Standard of care and approved therapies
- Emerging approaches and novel targets
- Identified opportunities and white space
- Risk analysis (crowded targets, high failure rates)
- Strategic recommendations:
  * Patient population to target
  * Differentiation strategies
  * Partnership opportunities
  * Regulatory pathway considerations
- Export as professional PDF with citations and data tables

Expected Output:
- Comprehensive trial database for indication
- Success rate and timeline statistics
- Competitive landscape mapping
- Unmet need analysis
- Strategic recommendations
- Publication-ready report with visualizations

Metabolomics & Systems Biology

Example 9: Multi-Omics Integration for Metabolic Disease

Objective: Integrate transcriptomics, proteomics, and metabolomics to identify dysregulated pathways in metabolic disease.

Skills Used:

pydeseq2 - RNA-seq analysis
pyopenms - Mass spectrometry
hmdb-database - Metabolite identification
metabolomics-workbench-database - Public datasets
kegg-database - Pathway mapping
reactome-database - Pathway analysis
string-database - Protein interactions
statsmodels - Multi-omics correlation
networkx - Network analysis
pymc - Bayesian modeling

Workflow:

Step 1: Process RNA-seq data
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare disease vs control (adjusted p < 0.05, |LFC| > 1)
- Extract gene symbols and fold changes
- Map to KEGG gene IDs

Step 2: Process proteomics data
- Load LC-MS/MS results with PyOpenMS
- Perform peptide identification and quantification
- Normalize protein abundances
- Run statistical testing (t-test or limma)
- Extract significant proteins (p < 0.05, |FC| > 1.5)

Step 3: Process metabolomics data
- Load untargeted metabolomics data (mzML format) with PyOpenMS
- Perform peak detection and alignment
- Match features to HMDB database by accurate mass
- Annotate metabolites with MS/MS fragmentation
- Extract putative identifications (Level 2/3)
- Perform statistical analysis (FDR < 0.05, |FC| > 2)

Step 4: Search Metabolomics Workbench for public data
- Query for same disease or tissue type
- Download relevant studies
- Reprocess for consistency with own data
- Use as validation cohort

Step 5: Map all features to KEGG pathways
- Map genes to KEGG orthology (KO) terms
- Map proteins to KEGG identifiers
- Map metabolites to KEGG compound IDs
- Identify pathways with multi-omics coverage

Step 6: Perform pathway enrichment analysis
- Test for enrichment in KEGG pathways
- Test for enrichment in Reactome pathways
- Apply Fisher's exact test with multiple testing correction
- Focus on pathways with hits in ≥2 omics layers

Step 7: Build protein-metabolite networks
- Query STRING for protein-protein interactions
- Map proteins to KEGG reactions
- Connect enzymes to their substrates/products
- Build integrated network with genes → proteins → metabolites

Step 8: Network topology analysis with NetworkX
- Calculate node centrality (degree, betweenness)
- Identify hub metabolites and key enzymes
- Find bottleneck reactions
- Detect network modules with community detection
- Identify dysregulated subnetworks

Step 9: Correlation analysis across omics layers
- Calculate Spearman correlations between:
  * Gene expression and protein abundance
  * Protein abundance and metabolite levels
  * Gene expression and metabolites (for enzyme-product pairs)
- Use statsmodels for significance testing
- Focus on enzyme-metabolite pairs with expected relationships

Step 10: Bayesian network modeling with PyMC
- Build probabilistic graphical model of pathway
- Model causal relationships: gene → protein → metabolite
- Incorporate prior knowledge from KEGG/Reactome
- Perform inference to identify key regulatory nodes
- Estimate effect sizes and uncertainties

Step 11: Identify therapeutic targets
- Prioritize enzymes with:
  * Significant changes in all three omics layers
  * High network centrality
  * Druggable target class (kinases, transporters, etc.)
- Query DrugBank for existing inhibitors
- Search PubMed for validation in disease models

Step 12: Create comprehensive multi-omics report
- Summary statistics for each omics layer
- Venn diagram of overlapping pathway hits
- Pathway enrichment dot plots
- Integrated network visualization (color by fold change)
- Correlation heatmaps (enzyme-metabolite pairs)
- Bayesian network structure
- Table of prioritized therapeutic targets
- Biological interpretation and mechanistic insights
- Generate publication-quality figures
- Export PDF report with all results

Expected Output:
- Integrated multi-omics dataset
- Dysregulated pathway identification
- Multi-omics network model
- Prioritized list of therapeutic targets
- Comprehensive systems biology report

Materials Science & Chemistry

Example 10: High-Throughput Materials Discovery for Battery Applications

Objective: Discover novel solid electrolyte materials for lithium-ion batteries using computational screening.

Skills Used:

pymatgen - Materials analysis
matminer - Feature engineering
scikit-learn - Machine learning
pymoo - Multi-objective optimization
ase - Atomic simulation
sympy - Symbolic math
vaex - Large dataset handling
matplotlib - Visualization
scientific-writing - Report generation

Workflow:

Step 1: Generate candidate materials library
- Use Pymatgen to enumerate compositions:
  * Li-containing compounds (Li₁₋ₓM₁₊ₓX₂)
  * M = transition metals (Zr, Ti, Ta, Nb)
  * X = O, S, Se
- Generate ~10,000 candidate compositions
- Apply charge neutrality constraints

Step 2: Filter by thermodynamic stability
- Query Materials Project database via Pymatgen
- Calculate formation energy from elements
- Calculate energy above convex hull (E_hull)
- Filter: E_hull < 50 meV/atom (likely stable)
- Retain ~2,000 thermodynamically plausible compounds

Step 3: Predict crystal structures
- Use Pymatgen structure predictor
- Generate most likely crystal structures for each composition
- Consider common structure types: LISICON, NASICON, garnet, perovskite
- Calculate structural descriptors

Step 4: Calculate material properties with Pymatgen
- Lattice parameters and volume
- Density
- Packing fraction
- Ionic radii and bond lengths
- Coordination environments

Step 5: Feature engineering with matminer
- Calculate compositional features:
  * Elemental property statistics (electronegativity, ionic radius)
  * Valence electron concentrations
  * Stoichiometric attributes
- Calculate structural features:
  * Pore size distribution
  * Site disorder parameters
  * Partial radial distribution functions

Step 6: Build ML models for Li⁺ conductivity prediction
- Collect training data from literature (experimental conductivities)
- Train ensemble models with scikit-learn:
  * Random Forest
  * Gradient Boosting
  * Neural Network
- Use 5-fold cross-validation
- Predict ionic conductivity for all candidates

Step 7: Predict additional properties
- Electrochemical stability window (ML model)
- Mechanical properties (bulk modulus, shear modulus)
- Interfacial resistance (estimate from structure)
- Synthesis temperature (ML prediction from similar compounds)

Step 8: Multi-objective optimization with PyMOO
Define optimization objectives:
- Maximize: ionic conductivity (>10⁻³ S/cm target)
- Maximize: electrochemical window (>4.5V target)
- Minimize: synthesis temperature (<800°C preferred)
- Minimize: cost (based on elemental abundance)

Run NSGA-II to find Pareto optimal solutions
Extract top 50 candidates from Pareto front

Step 9: Analyze Pareto optimal materials
- Identify composition trends (which elements appear frequently)
- Analyze structure-property relationships
- Calculate trade-offs between objectives
- Identify "sweet spot" compositions

Step 10: Validate predictions with DFT calculations
- Select top 10 candidates for detailed study
- Set up DFT calculations (VASP-like, if available via ASE)
- Calculate:
  * Accurate formation energies
  * Li⁺ migration barriers (NEB calculations)
  * Electronic band gap
  * Elastic constants
- Compare DFT results with ML predictions

Step 11: Literature and patent search
- Search for prior art on top candidates
- PubMed and Google Scholar: "[composition] AND electrolyte"
- USPTO: Check for existing patents on similar compositions
- Identify any experimental reports on related materials

Step 12: Generate materials discovery report
- Summary of screening workflow and statistics
- Pareto front visualization (conductivity vs stability vs cost)
- Structure visualization of top candidates
- Property comparison table
- Composition-property trend analysis
- DFT validation results
- Predicted performance vs state-of-art materials
- Synthesis recommendations
- IP landscape summary
- Prioritized list of 5-10 materials for experimental validation
- Export as publication-ready PDF

Expected Output:
- Screened library of 10,000+ materials
- ML models for property prediction
- Pareto-optimal set of 50 candidates
- Detailed analysis of top 10 materials
- DFT validation results
- Comprehensive materials discovery report

Digital Pathology

Example 11: Automated Tumor Detection in Whole Slide Images

Objective: Develop and validate a deep learning model for automated tumor detection in histopathology images.

Skills Used:

histolab - Whole slide image processing
pathml - Computational pathology
pytorch-lightning - Deep learning
torchvision - Image models
scikit-learn - Model evaluation
pydicom - DICOM handling
omero-integration - Image management
matplotlib - Visualization
shap - Model interpretability

Workflow:

Step 1: Load whole slide images with HistoLab
- Load WSI files (SVS, TIFF formats)
- Extract slide metadata and magnification levels
- Visualize slide thumbnails
- Inspect tissue area vs background

Step 2: Tile extraction and preprocessing
- Use HistoLab to extract tiles (256×256 pixels at 20× magnification)
- Filter tiles:
  * Remove background (tissue percentage > 80%)
  * Apply color normalization (Macenko or Reinhard method)
  * Filter out artifacts and bubbles
- Extract ~100,000 tiles per slide across all slides

Step 3: Create annotations (if training from scratch)
- Load pathologist annotations (if available via OMERO)
- Convert annotations to tile-level labels
- Categories: tumor, stroma, necrosis, normal
- Balance classes through stratified sampling

Step 4: Set up PathML pipeline
- Create PathML SlideData objects
- Define preprocessing pipeline:
  * Stain normalization
  * Color augmentation (HSV jitter)
  * Rotation and flipping
- Split data: 70% train, 15% validation, 15% test

Step 5: Build deep learning model with PyTorch Lightning
- Architecture: ResNet50 or EfficientNet backbone
- Add custom classification head for tissue types
- Define training pipeline:
  * Loss function: Cross-entropy or Focal loss
  * Optimizer: Adam with learning rate scheduling
  * Augmentations: rotation, flip, color jitter, elastic deformation
  * Batch size: 32
  * Mixed precision training

Step 6: Train model
- Train on tile-level labels
- Monitor metrics: accuracy, F1 score, AUC
- Use early stopping on validation loss
- Save best model checkpoint
- Training time: ~6-12 hours on GPU

Step 7: Evaluate model performance
- Test on held-out test set
- Calculate metrics with scikit-learn:
  * Accuracy, precision, recall, F1 per class
  * Confusion matrix
  * ROC curves and AUC
- Compute confidence intervals with bootstrapping

Step 8: Slide-level aggregation
- Apply model to all tiles in each test slide
- Aggregate predictions:
  * Majority voting
  * Weighted average by confidence
  * Spatial smoothing with convolution
- Generate probability heatmaps overlaid on WSI

Step 9: Model interpretability with SHAP
- Apply GradCAM or SHAP to explain predictions
- Visualize which regions contribute to tumor classification
- Generate attention maps showing model focus
- Validate that model attends to relevant histological features

Step 10: Clinical validation
- Compare model predictions with pathologist diagnosis
- Calculate inter-rater agreement (kappa score)
- Identify discordant cases for review
- Analyze error types: false positives, false negatives

Step 11: Integration with OMERO
- Upload processed slides and heatmaps to OMERO server
- Attach model predictions as slide metadata
- Enable pathologist review interface
- Store annotations and corrections for model retraining

Step 12: Generate clinical validation report
- Model architecture and training details
- Performance metrics with confidence intervals
- Slide-level accuracy vs pathologist ground truth
- Heatmap visualizations for representative cases
- Analysis of failure modes
- Comparison with published methods
- Discussion of clinical applicability
- Recommendations for deployment and monitoring
- Export PDF report for regulatory submission (if needed)

Expected Output:
- Trained deep learning model for tumor detection
- Tile-level and slide-level predictions
- Probability heatmaps for visualization
- Performance metrics and validation results
- Model interpretation visualizations
- Clinical validation report

Lab Automation & Protocol Design

Example 12: Automated High-Throughput Screening Protocol

Objective: Design and execute an automated compound screening workflow using liquid handling robots.

Skills Used:

pylabrobot - Lab automation
opentrons-integration - Opentrons protocol
benchling-integration - Sample tracking
protocolsio-integration - Protocol documentation
simpy - Process simulation
polars - Data processing
matplotlib - Plate visualization
reportlab - Report generation

Workflow:

Step 1: Define screening campaign in Benchling
- Create compound library in Benchling registry
- Register all compounds with structure, concentration, location
- Define plate layouts (384-well format)
- Track compound source plates in inventory
- Set up ELN entry for campaign documentation

Step 2: Design assay protocol
- Define assay steps:
  * Dispense cells (5000 cells/well)
  * Add compounds (dose-response curve, 10 concentrations)
  * Incubate 48 hours at 37°C
  * Add detection reagent (cell viability assay)
  * Read luminescence signal
- Calculate required reagent volumes
- Document protocol in Protocols.io
- Share with team for review

Step 3: Simulate workflow with SimPy
- Model liquid handler, incubator, plate reader as resources
- Simulate timing for 20 plates (7,680 wells)
- Identify bottlenecks (plate reader reads take 5 min/plate)
- Optimize scheduling: stagger plate processing
- Validate that throughput goal is achievable (20 plates/day)

Step 4: Design plate layout
- Use PyLabRobot to generate plate maps:
  * Columns 1-2: positive controls (DMSO)
  * Columns 3-22: compound titrations (10 concentrations in duplicate)
  * Columns 23-24: negative controls (cytotoxic control)
- Randomize compound positions across plates
- Account for edge effects (avoid outer wells for samples)
- Export plate maps to CSV

Step 5: Create Opentrons protocol for cell seeding
- Write Python protocol using Opentrons API 2.0
- Steps:
  * Aspirate cells from reservoir
  * Dispense 40 μL cell suspension per well
  * Tips: use P300 multi-channel for speed
  * Include mixing steps to prevent settling
- Simulate protocol in Opentrons app
- Test on one plate before full run

Step 6: Create Opentrons protocol for compound addition
- Acoustic liquid handler (Echo) or pin tool for nanoliter transfers
- If using Opentrons:
  * Source: 384-well compound plates
  * Transfer 100 nL compound (in DMSO) to assay plates
  * Use P20 for precision
  * Prepare serial dilutions on deck if needed
- Account for DMSO normalization (1% final)

Step 7: Integrate with Benchling for sample tracking
- Use Benchling API to:
  * Retrieve compound information (structure, batch, concentration)
  * Log plate creation in inventory
  * Create transfer records for audit trail
  * Link assay plates to ELN entry

Step 8: Execute automated workflow
- Day 1: Seed cells with Opentrons
- Day 1 (4h later): Add compounds with Opentrons
- Day 3: Add detection reagent (manual or automated)
- Day 3 (2h later): Read plates on plate reader
- Store plates at 4°C between steps

Step 9: Collect and process data
- Export raw luminescence data from plate reader
- Load data with Polars for fast processing
- Normalize data:
  * Subtract background (media-only wells)
  * Calculate % viability relative to DMSO control
  * Apply plate-wise normalization to correct systematic effects
- Quality control:
  * Z' factor calculation (> 0.5 for acceptable assay)
  * Coefficient of variation for controls (< 10%)
  * Flag plates with poor QC metrics

Step 10: Dose-response curve fitting
- Fit 4-parameter logistic curves for each compound
- Calculate IC50, Hill slope, max/min response
- Use scikit-learn or scipy for curve fitting
- Compute 95% confidence intervals
- Flag compounds with poor curve fits (R² < 0.8)

Step 11: Hit identification and triage
- Define hit criteria:
  * IC50 < 10 μM
  * Max inhibition > 50%
  * Curve quality: R² > 0.8
- Prioritize hits by potency
- Check for PAINS patterns with RDKit
- Cross-reference with known aggregators/frequent hitters

Step 12: Visualize results and generate report
- Create plate heatmaps showing % viability
- Dose-response curve plots for hits
- Scatter plot: potency vs max effect
- QC metric summary across plates
- Structure visualization of top 20 hits
- Generate campaign summary report:
  * Screening statistics (compounds tested, hit rate)
  * QC metrics and data quality assessment
  * Hit list with structures and IC50 values
  * Protocol documentation from Protocols.io
  * Raw data files and analysis code
  * Recommendations for confirmation assays
- Update Benchling ELN with results
- Export PDF report for stakeholders

Expected Output:
- Automated screening protocols (Opentrons Python files)
- Executed screen of 384-well plates
- Quality-controlled dose-response data
- Hit list with IC50 values
- Comprehensive screening report

Agricultural Genomics

Example 13: GWAS for Crop Yield Improvement

Objective: Identify genetic markers associated with drought tolerance and yield in a crop species.

Skills Used:

biopython - Sequence analysis
pysam - VCF processing
gwas-database - Public GWAS data
ensembl-database - Plant genomics
gene-database - Gene annotation
scanpy - Population structure (adapted for genetic data)
scikit-learn - PCA and clustering
statsmodels - Association testing
matplotlib - Manhattan plots
seaborn - Visualization

Workflow:

Step 1: Load and QC genotype data
- Load VCF file with pysam
- Filter variants:
  * Call rate > 95%
  * Minor allele frequency (MAF) > 5%
  * Hardy-Weinberg equilibrium p > 1e-6
- Convert to numeric genotype matrix (0, 1, 2)
- Retain ~500,000 SNPs after QC

Step 2: Assess population structure
- Calculate genetic relationship matrix
- Perform PCA with scikit-learn (use top 10 PCs)
- Visualize population structure (PC1 vs PC2)
- Identify distinct subpopulations or admixture
- Note: will use PCs as covariates in GWAS

Step 3: Load and process phenotype data
- Drought tolerance score (1-10 scale, measured under stress)
- Grain yield (kg/hectare)
- Days to flowering
- Plant height
- Quality control:
  * Remove outliers (> 3 SD from mean)
  * Transform if needed (log or rank-based for skewed traits)
  * Adjust for environmental covariates (field, year)

Step 4: Calculate kinship matrix
- Compute genetic relatedness matrix
- Account for population structure and relatedness
- Will use in mixed linear model to control for confounding

Step 5: Run genome-wide association study
- For each phenotype, test association with each SNP
- Use mixed linear model (MLM) in statsmodels:
  * Fixed effects: SNP genotype, PCs (top 10)
  * Random effects: kinship matrix
  * Bonferroni threshold: p < 5e-8 (genome-wide significance)
- Multiple testing correction: Bonferroni or FDR
- Calculate genomic inflation factor (λ) to check for inflation

Step 6: Identify significant associations
- Extract SNPs passing significance threshold
- Determine lead SNPs (most significant in each locus)
- Define loci: extend ±500 kb around lead SNP
- Identify independent associations via conditional analysis

Step 7: Annotate significant loci
- Map SNPs to genes using Ensembl Plants API
- Identify genic vs intergenic SNPs
- For genic SNPs:
  * Determine consequence (missense, synonymous, intronic, UTR)
  * Extract gene names and descriptions
- Query NCBI Gene for gene function
- Prioritize genes with known roles in stress response or development

Step 8: Search GWAS Catalog for prior reports
- Query GWAS Catalog for similar traits in same or related species
- Check for replication of known loci
- Identify novel vs known associations

Step 9: Functional enrichment analysis
- Extract all genes within significant loci
- Perform GO enrichment analysis
- Test for enrichment in KEGG pathways
- Focus on pathways related to:
  * Drought stress response (ABA signaling, osmotic adjustment)
  * Photosynthesis and carbon fixation
  * Root development

Step 10: Estimate SNP heritability and genetic architecture
- Calculate variance explained by significant SNPs
- Estimate SNP-based heritability (proportion of variance explained)
- Assess genetic architecture: few large-effect vs many small-effect loci

Step 11: Build genomic prediction model
- Train genomic selection model with scikit-learn:
  * Ridge regression (GBLUP equivalent)
  * Elastic net
  * Random Forest
- Use all SNPs (not just significant ones)
- Cross-validate to predict breeding values
- Assess prediction accuracy

Step 12: Generate GWAS report
- Manhattan plots for each trait
- QQ plots to assess test calibration
- Regional association plots for significant loci
- Gene models overlaid on loci
- Table of significant SNPs with annotations
- Functional enrichment results
- Genomic prediction accuracy
- Biological interpretation:
  * Candidate genes for drought tolerance
  * Potential molecular mechanisms
  * Implications for breeding programs
- Recommendations:
  * SNPs to use for marker-assisted selection
  * Genes for functional validation
  * Crosses to generate mapping populations
- Export publication-quality PDF with all results

Expected Output:
- Significant SNP-trait associations
- Annotated candidate genes
- Functional enrichment analysis
- Genomic prediction models
- Comprehensive GWAS report
- Recommendations for breeding programs

Neuroscience & Brain Imaging

Example 14: Brain Connectivity Analysis from fMRI Data

Objective: Analyze resting-state fMRI data to identify altered brain connectivity patterns in disease.

Skills Used:

neurokit2 - Neurophysiological signal processing
nilearn (external) - Neuroimaging analysis
scikit-learn - Classification and clustering
networkx - Graph theory analysis
statsmodels - Statistical testing
torch_geometric - Graph neural networks
pymc - Bayesian modeling
matplotlib - Brain visualization
seaborn - Connectivity matrices

Workflow:

Step 1: Load and preprocess fMRI data
# Note: Use nilearn or similar for fMRI-specific preprocessing
- Load 4D fMRI images (BOLD signal)
- Preprocessing:
  * Motion correction (realignment)
  * Slice timing correction
  * Spatial normalization to MNI space
  * Smoothing (6mm FWHM Gaussian kernel)
  * Temporal filtering (0.01-0.1 Hz bandpass)
  * Nuisance regression (motion, CSF, white matter)

Step 2: Define brain regions (parcellation)
- Apply brain atlas (e.g., AAL, Schaefer 200-region atlas)
- Extract average time series for each region
- Result: 200 time series per subject (one per brain region)

Step 3: Signal cleaning with NeuroKit2
- Denoise time series
- Remove physiological artifacts
- Apply additional bandpass filtering if needed
- Identify and handle outlier time points

Step 4: Calculate functional connectivity
- Compute pairwise Pearson correlations between all regions
- Result: 200×200 connectivity matrix per subject
- Fisher z-transform correlations for group statistics
- Threshold weak connections (|r| < 0.2)

Step 5: Graph theory analysis with NetworkX
- Convert connectivity matrices to graphs
- Calculate global network metrics:
  * Clustering coefficient (local connectivity)
  * Path length (integration)
  * Small-worldness (balance of segregation and integration)
  * Modularity (community structure)
- Calculate node-level metrics:
  * Degree centrality
  * Betweenness centrality
  * Eigenvector centrality
  * Participation coefficient (inter-module connectivity)

Step 6: Statistical comparison between groups
- Compare patients vs healthy controls
- Use statsmodels for group comparisons:
  * Paired or unpaired t-tests for connectivity edges
  * FDR correction for multiple comparisons across all edges
  * Identify edges with significantly different connectivity
- Compare global and node-level network metrics
- Calculate effect sizes (Cohen's d)

Step 7: Identify altered subnetworks
- Threshold statistical maps (FDR < 0.05)
- Identify clusters of altered connectivity
- Map to functional brain networks:
  * Default mode network (DMN)
  * Salience network (SN)
  * Central executive network (CEN)
  * Sensorimotor network
- Visualize altered connections on brain surfaces

Step 8: Machine learning classification
- Train classifier to distinguish patients from controls
- Use scikit-learn Random Forest or SVM
- Features: connectivity values or network metrics
- Cross-validation (10-fold)
- Calculate accuracy, sensitivity, specificity, AUC
- Identify most discriminative features (connectivity edges)

Step 9: Graph neural network analysis with Torch Geometric
- Build graph neural network (GCN or GAT)
- Input: connectivity matrices as adjacency matrices
- Train to predict diagnosis
- Extract learned representations
- Visualize latent space (UMAP)
- Interpret which brain regions are most important

Step 10: Bayesian network modeling with PyMC
- Build directed graphical model of brain networks
- Estimate effective connectivity (directional influence)
- Incorporate prior knowledge about anatomical connections
- Perform posterior inference
- Identify key driver regions in disease

Step 11: Clinical correlation analysis
- Correlate network metrics with clinical scores:
  * Symptom severity
  * Cognitive performance
  * Treatment response
- Use Spearman or Pearson correlation
- Identify brain-behavior relationships

Step 12: Generate comprehensive neuroimaging report
- Brain connectivity matrices (patients vs controls)
- Statistical comparison maps on brain surface
- Network metric comparison bar plots
- Graph visualizations (circular or force-directed layout)
- Machine learning ROC curves
- Brain-behavior correlation plots
- Clinical interpretation:
  * Which networks are disrupted?
  * Relationship to symptoms
  * Potential biomarker utility
- Recommendations:
  * Brain regions for therapeutic targeting (TMS, DBS)
  * Network metrics as treatment response predictors
- Export publication-ready PDF with brain visualizations

Expected Output:
- Functional connectivity matrices for all subjects
- Statistical maps of altered connectivity
- Graph theory metrics
- Machine learning classification model
- Brain-behavior correlations
- Comprehensive neuroimaging report

Environmental Microbiology

Example 15: Metagenomic Analysis of Environmental Samples

Objective: Characterize microbial community composition and functional potential from environmental DNA samples.

Skills Used:

biopython - Sequence processing
pysam - BAM file handling
ena-database - Sequence data
uniprot-database - Protein annotation
kegg-database - Pathway analysis
etetoolkit - Phylogenetic trees
scikit-bio - Microbial ecology
networkx - Co-occurrence networks
statsmodels - Diversity statistics
matplotlib - Visualization

Workflow:

Step 1: Load and QC metagenomic reads
- Load FASTQ files with BioPython
- Quality control with FastQC-equivalent:
  * Remove adapters and low-quality bases (Q < 20)
  * Filter short reads (< 50 bp)
  * Remove host contamination (if applicable)
- Subsample to even depth if comparing samples

Step 2: Taxonomic classification
- Use Kraken2-like approach or query ENA database
- Classify reads to taxonomic lineages
- Generate abundance table:
  * Rows: taxa (species or OTUs)
  * Columns: samples
  * Values: read counts or relative abundance
- Summarize at different levels: phylum, class, order, family, genus, species

Step 3: Calculate diversity metrics with scikit-bio
- Alpha diversity (within-sample):
  * Richness (number of species)
  * Shannon entropy
  * Simpson diversity
  * Chao1 estimated richness
- Beta diversity (between-sample):
  * Bray-Curtis dissimilarity
  * Weighted/unweighted UniFrac distance
  * Jaccard distance
- Rarefaction curves to assess sampling completeness

Step 4: Statistical comparison of communities
- Compare diversity between groups (e.g., polluted vs pristine)
- Use statsmodels for:
  * Mann-Whitney or Kruskal-Wallis tests (alpha diversity)
  * PERMANOVA for beta diversity (adonis test)
  * LEfSe for differential abundance testing
- Identify taxa enriched or depleted in each condition

Step 5: Build phylogenetic tree with ETE Toolkit
- Extract 16S rRNA sequences (or marker genes)
- Align sequences (MUSCLE/MAFFT equivalent)
- Build phylogenetic tree (neighbor-joining or maximum likelihood)
- Visualize tree colored by sample or environment
- Root tree with outgroup

Step 6: Co-occurrence network analysis
- Calculate pairwise correlations between taxa
- Use Spearman correlation to identify co-occurrence patterns
- Filter significant correlations (p < 0.01, |r| > 0.6)
- Build co-occurrence network with NetworkX
- Identify modules (communities of co-occurring taxa)
- Calculate network topology metrics
- Visualize network (nodes = taxa, edges = correlations)

Step 7: Functional annotation
- Assemble contigs from reads (if performing assembly)
- Predict genes with Prodigal-like tools
- Annotate genes using UniProt and KEGG
- Map proteins to KEGG pathways
- Generate functional profile:
  * Abundance of metabolic pathways
  * Key enzymes (nitrification, denitrification, methanogenesis)
  * Antibiotic resistance genes
  * Virulence factors

Step 8: Functional diversity analysis
- Compare functional profiles between samples
- Calculate pathway richness and evenness
- Identify enriched pathways with statistical testing
- Link taxonomy to function:
  * Which taxa contribute to which functions?
  * Use shotgun data to assign functions to taxa

Step 9: Search ENA for related environmental samples
- Query ENA for metagenomic studies from similar environments
- Download and compare to own samples
- Place samples in context of global microbiome diversity
- Identify unique vs ubiquitous taxa

Step 10: Environmental parameter correlation
- Correlate community composition with metadata:
  * Temperature, pH, salinity
  * Nutrient concentrations (N, P)
  * Pollutant levels (heavy metals, hydrocarbons)
- Use Mantel test to correlate distance matrices
- Identify environmental drivers of community structure

Step 11: Biomarker discovery
- Identify taxa or pathways that correlate with environmental condition
- Use Random Forest to find predictive features
- Validate biomarkers:
  * Sensitivity and specificity
  * Cross-validation across samples
- Propose taxa as bioindicators of environmental health

Step 12: Generate environmental microbiome report
- Taxonomic composition bar charts (stacked by phylum/class)
- Alpha and beta diversity plots (boxplots, PCoA)
- Phylogenetic tree with environmental context
- Co-occurrence network visualization
- Functional pathway heatmaps
- Environmental correlation plots
- Statistical comparison tables
- Biological interpretation:
  * Dominant taxa and their ecological roles
  * Functional potential of the community
  * Environmental factors shaping the microbiome
  * Biomarker taxa for monitoring
- Recommendations:
  * Biomarkers for environmental monitoring
  * Functional guilds for restoration
  * Further sampling or sequencing strategies
- Export comprehensive PDF report

Expected Output:
- Taxonomic profiles for all samples
- Diversity metrics and statistical comparisons
- Phylogenetic tree
- Co-occurrence network
- Functional annotation and pathway analysis
- Comprehensive microbiome report

Infectious Disease Research

Example 16: Antimicrobial Resistance Surveillance and Prediction

Objective: Track antimicrobial resistance trends and predict resistance phenotypes from genomic data.

Skills Used:

biopython - Sequence analysis
pysam - Genome assembly analysis
ena-database - Public genomic data
uniprot-database - Resistance protein annotation
gene-database - Resistance gene catalogs
etetoolkit - Phylogenetic analysis
scikit-learn - Resistance prediction
networkx - Transmission networks
statsmodels - Trend analysis
matplotlib - Epidemiological plots

Workflow:

Step 1: Collect bacterial genome sequences
- Isolates from hospital surveillance program
- Load FASTA assemblies with BioPython
- Basic QC:
  * Assess assembly quality (N50, completeness)
  * Estimate genome size and coverage
  * Remove contaminated assemblies

Step 2: Species identification and MLST typing
- Perform in silico MLST (multi-locus sequence typing)
- Extract housekeeping gene sequences
- Assign sequence types (ST)
- Classify isolates into clonal complexes
- Identify high-risk clones (e.g., ST131 E. coli, ST258 K. pneumoniae)

Step 3: Antimicrobial resistance (AMR) gene detection
- Query NCBI Gene and UniProt for AMR gene databases
- Screen assemblies for resistance genes:
  * Beta-lactamases (blaTEM, blaCTX-M, blaKPC, blaNDM)
  * Aminoglycoside resistance (aac, aph, ant)
  * Fluoroquinolone resistance (gyrA, parC mutations)
  * Colistin resistance (mcr-1 to mcr-10)
  * Efflux pumps
- Calculate gene presence/absence matrix

Step 4: Resistance mechanism annotation
- Map detected genes to resistance classes:
  * Enzymatic modification (e.g., beta-lactamases)
  * Target modification (e.g., ribosomal methylation)
  * Target mutation (e.g., fluoroquinolone resistance)
  * Efflux pumps
- Query UniProt for detailed mechanism descriptions
- Link genes to antibiotic classes affected

Step 5: Build phylogenetic tree with ETE Toolkit
- Extract core genome SNPs
- Concatenate SNP alignments
- Build maximum likelihood tree
- Root with outgroup or midpoint rooting
- Annotate tree with:
  * Resistance profiles
  * Sequence types
  * Collection date and location

Step 6: Genotype-phenotype correlation
- Match genomic data with phenotypic susceptibility testing
- For each antibiotic, correlate:
  * Presence of resistance genes with MIC values
  * Target mutations with resistance phenotype
- Calculate sensitivity/specificity of genetic markers
- Identify discordant cases (false positives/negatives)

Step 7: Machine learning resistance prediction
- Train classification models with scikit-learn:
  * Features: presence/absence of resistance genes + mutations
  * Target: resistance phenotype (susceptible/intermediate/resistant)
  * Models: Logistic Regression, Random Forest, Gradient Boosting
- Train separate models for each antibiotic
- Cross-validate (stratified 5-fold)
- Calculate accuracy, precision, recall, F1 score
- Feature importance: which genes are most predictive?

Step 8: Temporal trend analysis
- Track resistance rates over time
- Use statsmodels for:
  * Mann-Kendall trend test
  * Joinpoint regression (identify change points)
  * Forecast future resistance rates (ARIMA)
- Analyze trends for each antibiotic class
- Identify emerging resistance mechanisms

Step 9: Transmission network inference
- Identify closely related isolates (< 10 SNPs difference)
- Build transmission network with NetworkX:
  * Nodes: isolates
  * Edges: putative transmission links
- Incorporate temporal and spatial data
- Identify outbreak clusters
- Detect super-spreaders (high degree nodes)
- Analyze network topology

Step 10: Search ENA for global context
- Query ENA for same species from other regions/countries
- Download representative genomes
- Integrate into phylogenetic analysis
- Assess whether local isolates are globally distributed clones
- Identify region-specific vs international resistance genes

Step 11: Plasmid and mobile element analysis
- Identify plasmid contigs
- Detect insertion sequences and transposons
- Track mobile genetic elements carrying resistance genes
- Identify conjugative plasmids facilitating horizontal gene transfer
- Build plasmid similarity networks

Step 12: Generate AMR surveillance report
- Summary statistics:
  * Number of isolates by species, ST, location
  * Resistance rates for each antibiotic
- Phylogenetic tree annotated with resistance profiles
- Temporal trend plots (resistance % over time)
- Transmission network visualizations
- Prediction model performance metrics
- Heatmap: resistance genes by isolate
- Geographic distribution map (if spatial data available)
- Interpretation:
  * Predominant resistance mechanisms
  * High-risk clones circulating
  * Temporal trends and emerging threats
  * Transmission clusters and outbreaks
- Recommendations:
  * Infection control measures for clusters
  * Antibiotic stewardship priorities
  * Resistance genes to monitor
  * Laboratories to perform confirmatory testing
- Export comprehensive PDF for public health reporting

Expected Output:
- AMR gene profiles for all isolates
- Phylogenetic tree with resistance annotations
- Temporal trends in resistance rates
- ML models for resistance prediction from genomes
- Transmission networks
- Comprehensive AMR surveillance report for public health

Multi-Omics Integration

Example 17: Integrative Analysis of Cancer Multi-Omics Data

Objective: Integrate genomics, transcriptomics, proteomics, and clinical data to identify cancer subtypes and therapeutic strategies.

Skills Used:

pydeseq2 - RNA-seq DE analysis
pysam - Variant calling
ensembl-database - Gene annotation
cosmic-database - Cancer mutations
string-database - Protein interactions
reactome-database - Pathway analysis
opentargets-database - Drug targets
scikit-learn - Clustering and classification
torch_geometric - Graph neural networks
umap-learn - Dimensionality reduction
statsmodels - Survival analysis
pymoo - Multi-objective optimization

Workflow:

Step 1: Load and preprocess genomic data (WES/WGS)
- Parse VCF files with pysam
- Filter high-quality variants (QUAL > 30, DP > 20)
- Annotate with Ensembl VEP (missense, nonsense, frameshift)
- Query COSMIC for known cancer mutations
- Create mutation matrix: samples × genes (binary: mutated or not)
- Focus on cancer genes from COSMIC Cancer Gene Census

Step 2: Process transcriptomic data (RNA-seq)
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare tumor vs normal (if paired samples available)
- Normalize counts (TPM or FPKM)
- Identify highly variable genes
- Create expression matrix: samples × genes (log2 TPM)

Step 3: Load proteomic data (Mass spec)
- Protein abundance matrix from LC-MS/MS
- Normalize protein abundances (median normalization)
- Log2-transform
- Filter proteins detected in < 50% of samples
- Create protein matrix: samples × proteins

Step 4: Load clinical data
- Demographics: age, sex, race
- Tumor characteristics: stage, grade, histology
- Treatment: surgery, chemo, radiation, targeted therapy
- Outcome: overall survival (OS), progression-free survival (PFS)
- Response: complete/partial response, stable/progressive disease

Step 5: Data integration and harmonization
- Match sample IDs across omics layers
- Ensure consistent gene/protein identifiers
- Handle missing data:
  * Impute with KNN or median (for moderate missingness)
  * Remove features with > 50% missing
- Create multi-omics data structure (dictionary of matrices)

Step 6: Multi-omics dimensionality reduction
- Concatenate all omics features (genes + proteins + mutations)
- Apply UMAP with umap-learn for visualization
- Alternative: PCA or t-SNE
- Visualize samples in 2D space colored by:
  * Histological subtype
  * Stage
  * Survival (high vs low)
- Identify patterns or clusters

Step 7: Unsupervised clustering to identify subtypes
- Perform consensus clustering with scikit-learn
- Test k = 2 to 10 clusters
- Evaluate cluster stability and optimal k
- Assign samples to clusters (subtypes)
- Visualize clustering in UMAP space

Step 8: Characterize molecular subtypes
For each subtype:
- Differential expression analysis:
  * Compare subtype vs all others with PyDESeq2
  * Extract top differentially expressed genes and proteins
- Mutation enrichment:
  * Fisher's exact test for each gene
  * Identify subtype-specific mutations
- Pathway enrichment:
  * Query Reactome for enriched pathways
  * Query KEGG for metabolic pathway differences
  * Identify hallmark biological processes

Step 9: Build protein-protein interaction networks
- Query STRING database for interactions among:
  * Differentially expressed proteins
  * Products of mutated genes
- Construct PPI network with NetworkX
- Identify network modules (community detection)
- Calculate centrality metrics to find hub proteins
- Overlay fold changes on network for visualization

Step 10: Survival analysis by subtype
- Use statsmodels or lifelines for survival analysis
- Kaplan-Meier curves for each subtype
- Log-rank test for significance
- Cox proportional hazards model:
  * Covariates: subtype, stage, age, treatment
  * Estimate hazard ratios
- Identify prognostic subtypes

Step 11: Predict therapeutic response
- Train machine learning models with scikit-learn:
  * Features: multi-omics data
  * Target: response to specific therapy (responder/non-responder)
  * Models: Random Forest, XGBoost, SVM
- Cross-validation to assess performance
- Identify features predictive of response
- Calculate AUC and feature importance

Step 12: Graph neural network for integrated prediction
- Build heterogeneous graph with Torch Geometric:
  * Nodes: samples, genes, proteins, pathways
  * Edges: gene-protein, protein-protein, gene-pathway
  * Node features: expression, mutation status
- Train GNN to predict:
  * Subtype classification
  * Survival risk
  * Treatment response
- Extract learned embeddings for interpretation

Step 13: Identify therapeutic targets with Open Targets
- For each subtype, query Open Targets:
  * Input: upregulated genes/proteins
  * Extract target-disease associations
  * Prioritize by tractability score
- Search for FDA-approved drugs targeting identified proteins
- Identify clinical trials for relevant targets
- Propose subtype-specific therapeutic strategies

Step 14: Multi-objective optimization of treatment strategies
- Use PyMOO to optimize treatment selection:
  * Objectives:
    1. Maximize predicted response probability
    2. Minimize predicted toxicity
    3. Minimize cost
  * Constraints: patient eligibility, drug availability
- Generate Pareto-optimal treatment strategies
- Personalized treatment recommendations per patient

Step 15: Generate comprehensive multi-omics report
- Sample clustering and subtype assignments
- UMAP visualization colored by subtype, survival, mutations
- Subtype characterization:
  * Molecular signatures (genes, proteins, mutations)
  * Enriched pathways
  * PPI networks
- Kaplan-Meier survival curves by subtype
- ML model performance (AUC, confusion matrices)
- Feature importance plots
- Therapeutic target tables with supporting evidence
- Personalized treatment recommendations
- Clinical implications:
  * Prognostic biomarkers
  * Predictive biomarkers for therapy selection
  * Novel drug targets
- Export publication-quality PDF with all figures and tables

Expected Output:
- Integrated multi-omics dataset
- Cancer subtype classification
- Molecular characterization of subtypes
- Survival analysis and prognostic markers
- Predictive models for treatment response
- Therapeutic target identification
- Personalized treatment strategies
- Comprehensive integrative genomics report

Experimental Physics & Data Analysis

Example 18: Analysis of Particle Physics Detector Data

Objective: Analyze experimental data from particle detector to identify signal events and measure physical constants.

Skills Used:

astropy - Units and constants
sympy - Symbolic mathematics
scipy - Statistical analysis
scikit-learn - Classification
stable-baselines3 - Reinforcement learning for optimization
matplotlib - Visualization
seaborn - Statistical plots
statsmodels - Hypothesis testing
dask - Large-scale data processing
vaex - Out-of-core dataframes

Workflow:

Step 1: Load and inspect detector data
- Load ROOT files or HDF5 with raw detector signals
- Use Vaex for out-of-core processing (TBs of data)
- Inspect data structure: event IDs, timestamps, detector channels
- Extract key observables:
  * Energy deposits in calorimeters
  * Particle trajectories from tracking detectors
  * Time-of-flight measurements
  * Trigger information

Step 2: Apply detector calibration and corrections
- Load calibration constants
- Apply energy calibrations to convert ADC to physical units
- Correct for detector efficiency variations
- Apply geometric corrections (alignment)
- Use Astropy units for unit conversions (eV, GeV, MeV)
- Account for dead time and detector acceptance

Step 3: Event reconstruction
- Cluster energy deposits to form particle candidates
- Reconstruct particle trajectories (tracks)
- Match tracks to calorimeter clusters
- Calculate invariant masses for particle identification
- Compute momentum and energy for each particle
- Use Dask for parallel processing across events

Step 4: Event selection and filtering
- Define signal region based on physics hypothesis
- Apply quality cuts:
  * Track quality (chi-squared, number of hits)
  * Fiducial volume cuts
  * Timing cuts (beam window)
  * Particle identification cuts
- Estimate trigger efficiency
- Calculate event weights for corrections

Step 5: Background estimation
- Identify background sources:
  * Cosmic rays
  * Beam-related backgrounds
  * Detector noise
  * Physics backgrounds (non-signal processes)
- Simulate backgrounds using Monte Carlo (if available)
- Estimate background from data in control regions
- Use sideband subtraction method

Step 6: Signal extraction
- Fit invariant mass distributions to extract signal
- Use scipy for likelihood fitting:
  * Signal model: Gaussian or Breit-Wigner
  * Background model: polynomial or exponential
  * Combined fit with maximum likelihood
- Calculate signal significance (S/√B or Z-score)
- Estimate systematic uncertainties

Step 7: Machine learning event classification
- Train classifier with scikit-learn to separate signal from background
- Features: kinematic variables, topology, detector response
- Models: Boosted Decision Trees (XGBoost), Neural Networks
- Cross-validate with k-fold CV
- Optimize selection criteria using ROC curves
- Calculate signal efficiency and background rejection

Step 8: Reinforcement learning for trigger optimization
- Use Stable-Baselines3 to optimize trigger thresholds
- Environment: detector simulator
- Action: adjust trigger thresholds
- Reward: maximize signal efficiency while controlling rate
- Train PPO or SAC agent
- Validate on real data

Step 9: Calculate physical observables
- Measure cross-sections:
  * σ = N_signal / (ε × L × BR)
  * N_signal: number of signal events
  * ε: detection efficiency
  * L: integrated luminosity
  * BR: branching ratio
- Use Sympy for symbolic error propagation
- Calculate with Astropy for proper unit handling

Step 10: Statistical analysis and hypothesis testing
- Perform hypothesis tests with statsmodels:
  * Likelihood ratio test for signal vs background-only
  * Calculate p-values and significance levels
  * Set confidence limits (CLs method)
- Bayesian analysis for parameter estimation
- Calculate confidence intervals and error bands

Step 11: Systematic uncertainty evaluation
- Identify sources of systematic uncertainty:
  * Detector calibration uncertainties
  * Background estimation uncertainties
  * Theoretical uncertainties (cross-sections, PDFs)
  * Monte Carlo modeling uncertainties
- Propagate uncertainties through analysis chain
- Combine statistical and systematic uncertainties
- Present as error budget

Step 12: Create comprehensive physics report
- Event displays showing candidate signal events
- Kinematic distributions (momentum, energy, angles)
- Invariant mass plots with fitted signal
- ROC curves for ML classifiers
- Cross-section measurements with error bars
- Comparison with theoretical predictions
- Systematic uncertainty breakdown
- Statistical significance calculations
- Interpretation:
  * Consistency with Standard Model
  * Constraints on new physics parameters
  * Discovery potential or exclusion limits
- Recommendations:
  * Detector improvements
  * Additional data needed
  * Future analysis strategies
- Export publication-ready PDF formatted for physics journal

Expected Output:
- Reconstructed physics events
- Signal vs background classification
- Measured cross-sections and branching ratios
- Statistical significance of observations
- Systematic uncertainty analysis
- Comprehensive experimental physics paper

Chemical Engineering & Process Optimization

Example 19: Optimization of Chemical Reactor Design and Operation

Objective: Design and optimize a continuous chemical reactor for maximum yield and efficiency while meeting safety and economic constraints.

Skills Used:

sympy - Symbolic equations and reaction kinetics
scipy - Numerical integration and optimization
pymoo - Multi-objective optimization
simpy - Process simulation
pymc - Bayesian parameter estimation
scikit-learn - Process modeling
stable-baselines3 - Real-time control optimization
matplotlib - Process diagrams
reportlab - Engineering reports

Workflow:

Step 1: Define reaction system and kinetics
- Chemical reaction: A + B → C + D
- Use Sympy to define symbolic rate equations:
  * Arrhenius equation: k = A × exp(-Ea/RT)
  * Rate law: r = k × [A]^α × [B]^β
- Define material and energy balances symbolically
- Include equilibrium constants and thermodynamics
- Account for side reactions and byproducts

Step 2: Develop reactor model
- Select reactor type: CSTR, PFR, batch, or semi-batch
- Write conservation equations:
  * Mass balance: dC/dt = (F_in × C_in - F_out × C)/V + r
  * Energy balance: ρCp × dT/dt = Q - ΔH_rxn × r × V
  * Momentum balance (pressure drop)
- Include heat transfer correlations
- Model mixing and mass transfer limitations

Step 3: Parameter estimation with PyMC
- Load experimental data from pilot reactor
- Bayesian inference to estimate kinetic parameters:
  * Pre-exponential factor (A)
  * Activation energy (Ea)
  * Reaction orders (α, β)
- Use MCMC sampling with PyMC
- Incorporate prior knowledge from literature
- Calculate posterior distributions and credible intervals
- Assess parameter uncertainty and correlation

Step 4: Model validation
- Simulate reactor with estimated parameters using scipy.integrate
- Compare predictions with experimental data
- Calculate goodness of fit (R², RMSE)
- Perform sensitivity analysis:
  * Which parameters most affect yield?
  * Identify critical operating conditions
- Refine model if needed

Step 5: Machine learning surrogate model
- Train fast surrogate model with scikit-learn
- Generate training data from detailed model (1000+ runs)
- Features: T, P, residence time, feed composition, catalyst loading
- Target: yield, selectivity, conversion
- Models: Gaussian Process Regression, Random Forest
- Validate surrogate accuracy (R² > 0.95)
- Use for rapid optimization

Step 6: Single-objective optimization
- Maximize yield with scipy.optimize:
  * Decision variables: T, P, feed ratio, residence time
  * Objective: maximize Y = (moles C produced) / (moles A fed)
  * Constraints:
    - Temperature: 300 K ≤ T ≤ 500 K (safety)
    - Pressure: 1 bar ≤ P ≤ 50 bar (equipment limits)
    - Residence time: 1 min ≤ τ ≤ 60 min
    - Conversion: X_A ≥ 90%
- Use Sequential Least Squares Programming (SLSQP)
- Identify optimal operating point

Step 7: Multi-objective optimization with PyMOO
- Competing objectives:
  1. Maximize product yield
  2. Minimize energy consumption (heating/cooling)
  3. Minimize operating cost (raw materials, utilities)
  4. Maximize reactor productivity (throughput)
- Constraints:
  - Safety: temperature and pressure limits
  - Environmental: waste production limits
  - Economic: minimum profitability
- Run NSGA-II or NSGA-III
- Generate Pareto front of optimal solutions
- Select operating point based on preferences

Step 8: Dynamic process simulation with SimPy
- Model complete plant:
  * Reactors, separators, heat exchangers
  * Pumps, compressors, valves
  * Storage tanks and buffers
- Simulate startup, steady-state, and shutdown
- Include disturbances:
  * Feed composition variations
  * Equipment failures
  * Demand fluctuations
- Evaluate dynamic stability
- Calculate time to steady state

Step 9: Control system design
- Design feedback control loops:
  * Temperature control (PID controller)
  * Pressure control
  * Flow control
  * Level control
- Tune PID parameters using Ziegler-Nichols or optimization
- Implement cascade control for improved performance
- Add feedforward control for disturbance rejection

Step 10: Reinforcement learning for advanced control
- Use Stable-Baselines3 to train RL agent:
  * Environment: reactor simulation (SimPy-based)
  * State: T, P, concentrations, flow rates
  * Actions: adjust setpoints, flow rates, heating/cooling
  * Reward: +yield -energy cost -deviation from setpoint
- Train PPO or TD3 agent
- Compare with conventional PID control
- Evaluate performance under disturbances
- Implement model-free adaptive control

Step 11: Economic analysis
- Calculate capital costs (CAPEX):
  * Reactor vessel cost (function of size, pressure rating)
  * Heat exchanger costs
  * Pumps and instrumentation
  * Installation costs
- Calculate operating costs (OPEX):
  * Raw materials (A, B, catalyst)
  * Utilities (steam, cooling water, electricity)
  * Labor and maintenance
- Revenue from product sales
- Calculate economic metrics:
  * Net present value (NPV)
  * Internal rate of return (IRR)
  * Payback period
  * Levelized cost of production

Step 12: Safety analysis
- Identify hazards:
  * Exothermic runaway reactions
  * Pressure buildup
  * Toxic or flammable materials
- Perform HAZOP-style analysis
- Calculate safe operating limits:
  * Maximum temperature of synthesis reaction (MTSR)
  * Adiabatic temperature rise
  * Relief valve sizing
- Design emergency shutdown systems
- Implement safety interlocks

Step 13: Uncertainty quantification
- Propagate parameter uncertainties from PyMC:
  * How does kinetic parameter uncertainty affect yield?
  * Monte Carlo simulation with parameter distributions
- Evaluate robustness of optimal design
- Calculate confidence intervals on economic metrics
- Identify critical uncertainties for further study

Step 14: Generate comprehensive engineering report
- Executive summary of project objectives and results
- Process flow diagram (PFD) with material and energy streams
- Reaction kinetics and model equations
- Parameter estimation results with uncertainties
- Optimization results:
  * Pareto front for multi-objective optimization
  * Recommended operating conditions
  * Trade-off analysis
- Dynamic simulation results (startup curves, response to disturbances)
- Control system design and tuning
- Economic analysis with sensitivity to key assumptions
- Safety analysis and hazard mitigation
- Scale-up considerations:
  * Pilot to commercial scale
  * Heat and mass transfer limitations
  * Equipment sizing
- Recommendations:
  * Optimal reactor design (size, type, materials of construction)
  * Operating conditions for maximum profitability
  * Control strategy
  * Further experimental studies needed
- Technical drawings and P&ID (piping and instrumentation diagram)
- Export as professional engineering report (PDF)

Expected Output:
- Validated reactor model with parameter uncertainties
- Optimal reactor design and operating conditions
- Pareto-optimal solutions for multi-objective optimization
- Dynamic process simulation results
- Advanced control strategies (RL-based)
- Economic feasibility analysis
- Safety assessment
- Comprehensive chemical engineering design report

Summary

These examples demonstrate:

Cross-domain applicability: Skills are useful across many scientific fields
Skill integration: Complex workflows combine multiple databases, packages, and analysis methods
Real-world relevance: Examples address actual research questions and clinical needs
End-to-end workflows: From data acquisition to publication-ready reports
Best practices: QC, statistical rigor, visualization, interpretation, and documentation

How to Use These Examples

Adapt to your needs: Modify parameters, datasets, and objectives for your specific research question
Combine skills creatively: Mix and match skills from different categories
Follow the structure: Each example provides a clear step-by-step workflow
Generate comprehensive output: Aim for publication-quality figures and professional reports
Cite your sources: Always verify data and provide proper citations

Additional Notes

Always start with: "Always use available 'skills' when possible. Keep the output organized."
For complex projects, break into manageable steps and validate intermediate results
Save checkpoints and intermediate data files
Document parameters and decisions for reproducibility
Generate README files explaining methodology
Create PDFs for stakeholder communication

These examples showcase the power of combining the skills in this repository to tackle complex, real-world scientific challenges across multiple domains.

89 KiB Raw Permalink Blame History Unescape Escape

Real-World Scientific Examples

📋 Table of Contents

Drug Discovery & Medicinal Chemistry

Example 1: Discovery of Novel EGFR Inhibitors for Lung Cancer

Example 2: Drug Repurposing for Rare Diseases

Cancer Genomics & Precision Medicine

Example 3: Clinical Variant Interpretation Pipeline

Example 4: Cancer Subtype Classification from Gene Expression

Single-Cell Transcriptomics

Example 5: Single-Cell Atlas of Tumor Microenvironment

Protein Structure & Function

Example 6: Structure-Based Design of Protein-Protein Interaction Inhibitors

Chemical Safety & Toxicology

Example 7: Predictive Toxicology Assessment

Clinical Trial Analysis

Example 8: Competitive Landscape Analysis for New Indication

Metabolomics & Systems Biology

Example 9: Multi-Omics Integration for Metabolic Disease

Materials Science & Chemistry

Example 10: High-Throughput Materials Discovery for Battery Applications

Digital Pathology

Example 11: Automated Tumor Detection in Whole Slide Images

Lab Automation & Protocol Design

Example 12: Automated High-Throughput Screening Protocol

Agricultural Genomics

Example 13: GWAS for Crop Yield Improvement

Neuroscience & Brain Imaging

Example 14: Brain Connectivity Analysis from fMRI Data

Environmental Microbiology

Example 15: Metagenomic Analysis of Environmental Samples

Infectious Disease Research

Example 16: Antimicrobial Resistance Surveillance and Prediction

Multi-Omics Integration

Example 17: Integrative Analysis of Cancer Multi-Omics Data

Experimental Physics & Data Analysis

Example 18: Analysis of Particle Physics Detector Data

Chemical Engineering & Process Optimization

Example 19: Optimization of Chemical Reactor Design and Operation

Summary

How to Use These Examples

Additional Notes

89 KiB

Raw Permalink Blame History