95 KiB
Real-World Scientific Examples
This document provides comprehensive, practical examples demonstrating how to combine Claude Scientific Skills to solve real scientific problems across multiple domains.
📋 Table of Contents
- Drug Discovery & Medicinal Chemistry
- Cancer Genomics & Precision Medicine
- Single-Cell Transcriptomics
- Protein Structure & Function
- Chemical Safety & Toxicology
- Clinical Trial Analysis
- Metabolomics & Systems Biology
- Materials Science & Chemistry
- Digital Pathology
- Lab Automation & Protocol Design
- Agricultural Genomics
- Neuroscience & Brain Imaging
- Environmental Microbiology
- Infectious Disease Research
- Multi-Omics Integration
- Computational Chemistry & Synthesis
- Clinical Research & Real-World Evidence
- Experimental Physics & Data Analysis
- Chemical Engineering & Process Optimization
- Scientific Illustration & Visual Communication
Drug Discovery & Medicinal Chemistry
Example 1: Discovery of Novel EGFR Inhibitors for Lung Cancer
Objective: Identify novel small molecule inhibitors of EGFR with improved properties compared to existing drugs.
Skills Used:
chembl-database- Query bioactivity datapubchem-database- Search compound librariesrdkit- Analyze molecular propertiesdatamol- Generate analogsdiffdock- Molecular dockingalphafold-database- Retrieve protein structurepubmed-database- Literature reviewcosmic-database- Query mutationsdeepchem- Property predictionscientific-visualization- Create figures
Workflow:
# Always use available 'skills' when possible. Keep the output organized.
Step 1: Query ChEMBL for known EGFR inhibitors with high potency
- Search for compounds targeting EGFR (CHEMBL203)
- Filter: IC50 < 50 nM, pChEMBL value > 7
- Extract SMILES strings and activity data
- Export to DataFrame for analysis
Step 2: Analyze structure-activity relationships
- Load compounds into RDKit
- Calculate molecular descriptors (MW, LogP, TPSA, HBD, HBA)
- Generate Morgan fingerprints (radius=2, 2048 bits)
- Perform hierarchical clustering to identify scaffolds
- Visualize top scaffolds with activity annotations
Step 3: Identify resistance mutations from COSMIC
- Query COSMIC for EGFR mutations in lung cancer
- Focus on gatekeeper mutations (T790M, C797S)
- Extract mutation frequencies and clinical significance
- Cross-reference with literature in PubMed
Step 4: Retrieve EGFR structure from AlphaFold
- Download AlphaFold prediction for EGFR kinase domain
- Alternatively, use experimental structure from PDB (if available)
- Prepare structure for docking (add hydrogens, optimize)
Step 5: Generate novel analogs using datamol
- Select top 5 scaffolds from ChEMBL analysis
- Use scaffold decoration to generate 100 analogs per scaffold
- Apply Lipinski's Rule of Five filtering
- Ensure synthetic accessibility (SA score < 4)
- Check for PAINS and unwanted substructures
Step 6: Predict properties with DeepChem
- Train graph convolutional model on ChEMBL EGFR data
- Predict pIC50 for generated analogs
- Predict ADMET properties (solubility, permeability, hERG)
- Rank candidates by predicted potency and drug-likeness
Step 7: Virtual screening with DiffDock
- Perform molecular docking on top 50 candidates
- Dock into wild-type EGFR and T790M mutant
- Calculate binding energies and interaction patterns
- Identify compounds with favorable binding to both forms
Step 8: Search PubChem for commercial availability
- Query PubChem for top 10 candidates by InChI key
- Check supplier information and purchasing options
- Identify close analogs if exact matches unavailable
Step 9: Literature validation with PubMed
- Search for any prior art on top scaffolds
- Query: "[scaffold_name] AND EGFR AND inhibitor"
- Summarize relevant findings and potential liabilities
Step 10: Create comprehensive report
- Generate 2D structure visualizations of top hits
- Create scatter plots: MW vs LogP, TPSA vs potency
- Produce binding pose figures for top 3 compounds
- Generate table comparing properties to approved drugs (gefitinib, erlotinib)
- Write scientific summary with methodology, results, and recommendations
- Export to PDF with proper citations
Expected Output:
- Ranked list of 10-20 novel EGFR inhibitor candidates
- Predicted activity and ADMET properties
- Docking poses and binding analysis
- Comprehensive scientific report with publication-quality figures
Example 2: Drug Repurposing for Rare Diseases
Objective: Identify FDA-approved drugs that could be repurposed for treating a rare metabolic disorder.
Skills Used:
drugbank-database- Query approved drugsopentargets-database- Target-disease associationsstring-database- Protein interactionskegg-database- Pathway analysisreactome-database- Pathway enrichmentclinicaltrials-database- Check ongoing trialsfda-database- Drug approvals and safetynetworkx- Network analysisliterature-review- Systematic review
Workflow:
Step 1: Define disease pathway
- Query KEGG and Reactome for disease-associated pathways
- Identify key proteins and enzymes involved
- Map upstream and downstream pathway components
Step 2: Find protein-protein interactions
- Query STRING database for interaction partners
- Build protein interaction network around key disease proteins
- Identify hub proteins and bottlenecks using NetworkX
- Calculate centrality metrics (betweenness, closeness)
Step 3: Query Open Targets for druggable targets
- Search for targets associated with disease phenotype
- Filter by clinical precedence and tractability
- Prioritize targets with existing approved drugs
Step 4: Search DrugBank for drugs targeting identified proteins
- Query for approved drugs and their targets
- Filter by mechanism of action relevant to disease
- Retrieve drug properties and safety information
Step 5: Query FDA databases for safety profiles
- Check FDA adverse event database (FAERS)
- Review drug labels and black box warnings
- Assess risk-benefit for rare disease population
Step 6: Search ClinicalTrials.gov for prior repurposing attempts
- Query for disease name + drug names
- Check for failed trials (and reasons for failure)
- Identify ongoing trials that may compete
Step 7: Perform pathway enrichment analysis
- Map drug targets to disease pathways
- Calculate enrichment scores with Reactome
- Identify drugs affecting multiple pathway nodes
Step 8: Conduct systematic literature review
- Search PubMed for drug name + disease associations
- Include bioRxiv for recent unpublished findings
- Document any case reports or off-label use
- Use literature-review skill to generate comprehensive review
Step 9: Prioritize candidates
- Rank by: pathway relevance, safety profile, existing evidence
- Consider factors: oral availability, blood-brain barrier penetration
- Assess commercial viability and patent status
Step 10: Generate repurposing report
- Create network visualization of drug-target-pathway relationships
- Generate comparison table of top 5 candidates
- Write detailed rationale for each candidate
- Include mechanism of action diagrams
- Provide recommendations for preclinical validation
- Format as professional PDF with citations
Expected Output:
- Ranked list of 5-10 repurposing candidates
- Network analysis of drug-target-disease relationships
- Safety and efficacy evidence summary
- Repurposing strategy report with next steps
Cancer Genomics & Precision Medicine
Example 3: Clinical Variant Interpretation Pipeline
Objective: Analyze a patient's tumor sequencing data to identify actionable mutations and therapeutic recommendations.
Skills Used:
pysam- Parse VCF filesensembl-database- Variant annotationclinvar-database- Clinical significancecosmic-database- Somatic mutationsgene-database- Gene informationuniprot-database- Protein impactdrugbank-database- Drug-gene associationsclinicaltrials-database- Matching trialsopentargets-database- Target validationpubmed-database- Literature evidencereportlab- Generate clinical report
Workflow:
Step 1: Parse and filter VCF file
- Use pysam to read tumor VCF
- Filter for high-quality variants (QUAL > 30, DP > 20)
- Extract variant positions, alleles, and VAF (variant allele frequency)
- Separate SNVs, indels, and structural variants
Step 2: Annotate variants with Ensembl
- Query Ensembl VEP API for functional consequences
- Classify variants: missense, nonsense, frameshift, splice site
- Extract transcript information and protein changes
- Identify canonical transcripts for each gene
Step 3: Query ClinVar for known pathogenic variants
- Search ClinVar by genomic coordinates
- Extract clinical significance classifications
- Note conflicting interpretations and review status
- Prioritize variants with "Pathogenic" or "Likely Pathogenic" labels
Step 4: Query COSMIC for somatic cancer mutations
- Search COSMIC for each variant
- Extract mutation frequency across cancer types
- Identify hotspot mutations (high recurrence)
- Note drug resistance mutations
Step 5: Retrieve gene information from NCBI Gene
- Get detailed gene descriptions
- Extract associated phenotypes and diseases
- Identify oncogene vs tumor suppressor classification
- Note gene function and biological pathways
Step 6: Assess protein-level impact with UniProt
- Query UniProt for protein domain information
- Map variants to functional domains (kinase domain, binding site)
- Check if variant affects active sites or protein stability
- Retrieve post-translational modification sites
Step 7: Search DrugBank for targetable alterations
- Query for drugs targeting mutated genes
- Filter for FDA-approved and investigational drugs
- Extract mechanism of action and indications
- Prioritize variants with approved targeted therapies
Step 8: Query Open Targets for target-disease associations
- Validate therapeutic hypotheses
- Assess target tractability scores
- Review clinical precedence for each gene-disease pair
Step 9: Search ClinicalTrials.gov for matching trials
- Build query with: cancer type + gene names + variants
- Filter for: recruiting status, phase II/III trials
- Extract trial eligibility criteria
- Note geographic locations and contact information
Step 10: Literature search for clinical evidence
- PubMed query: "[gene] AND [variant] AND [cancer type]"
- Focus on: case reports, clinical outcomes, resistance mechanisms
- Extract relevant prognostic or predictive information
Step 11: Classify variants by actionability
Tier 1: FDA-approved therapy for this variant
Tier 2: Clinical trial available for this variant
Tier 3: Therapy approved for variant in different cancer
Tier 4: Biological evidence but no approved therapy
Step 12: Generate clinical genomics report
- Executive summary of key findings
- Table of actionable variants with evidence levels
- Therapeutic recommendations with supporting evidence
- Clinical trial options with eligibility information
- Prognostic implications based on mutation profile
- References to guidelines (NCCN, ESMO, AMP/ASCO/CAP)
- Generate professional PDF using ReportLab
Expected Output:
- Annotated variant list with clinical significance
- Tiered list of actionable mutations
- Therapeutic recommendations with evidence levels
- Matching clinical trials
- Comprehensive clinical genomics report (PDF)
Example 4: Cancer Subtype Classification from Gene Expression
Objective: Classify breast cancer subtypes using RNA-seq data and identify subtype-specific therapeutic vulnerabilities.
Skills Used:
pydeseq2- Differential expressionscanpy- Clustering and visualizationscikit-learn- Machine learning classificationgene-database- Gene annotationreactome-database- Pathway analysisopentargets-database- Drug targetspubmed-database- Literature validationmatplotlib- Visualizationseaborn- Heatmaps
Workflow:
Step 1: Load and preprocess RNA-seq data
- Load count matrix (genes × samples)
- Filter low-expression genes (mean counts < 10)
- Normalize with DESeq2 size factors
- Apply variance-stabilizing transformation (VST)
Step 2: Classify samples using PAM50 genes
- Query NCBI Gene for PAM50 classifier gene list
- Extract expression values for PAM50 genes
- Train Random Forest classifier on labeled training data
- Predict subtypes: Luminal A, Luminal B, HER2+, Basal, Normal-like
- Validate with published markers (ESR1, PGR, ERBB2, MKI67)
Step 3: Perform differential expression for each subtype
- Use PyDESeq2 to compare each subtype vs all others
- Apply multiple testing correction (FDR < 0.05)
- Filter by log2 fold change (|LFC| > 1.5)
- Identify subtype-specific signature genes
Step 4: Annotate differentially expressed genes
- Query NCBI Gene for detailed annotations
- Classify as oncogene, tumor suppressor, or other
- Extract biological process and molecular function terms
Step 5: Pathway enrichment analysis
- Submit gene lists to Reactome API
- Identify enriched pathways for each subtype (p < 0.01)
- Focus on druggable pathways (kinase signaling, metabolism)
- Compare pathway profiles across subtypes
Step 6: Identify therapeutic targets with Open Targets
- Query Open Targets for each upregulated gene
- Filter by tractability score > 5
- Prioritize targets with clinical precedence
- Extract associated drugs and development phase
Step 7: Create comprehensive visualization
- Generate UMAP projection of all samples colored by subtype
- Create heatmap of PAM50 genes across subtypes
- Produce volcano plots for each subtype comparison
- Generate pathway enrichment dot plots
- Create drug target-pathway network diagrams
Step 8: Literature validation
- Search PubMed for each predicted therapeutic target
- Query: "[gene] AND [subtype] AND breast cancer AND therapy"
- Summarize clinical evidence and ongoing trials
- Note any resistance mechanisms reported
Step 9: Generate subtype-specific recommendations
For each subtype:
- List top 5 differentially expressed genes
- Identify enriched biological pathways
- Recommend therapeutic strategies based on vulnerabilities
- Cite supporting evidence from literature
Step 10: Create comprehensive report
- Classification results with confidence scores
- Differential expression tables for each subtype
- Pathway enrichment summaries
- Therapeutic target recommendations
- Publication-quality figures
- Export to PDF with citations
Expected Output:
- Sample classification into molecular subtypes
- Subtype-specific gene signatures
- Pathway enrichment profiles
- Prioritized therapeutic targets for each subtype
- Scientific report with visualizations and recommendations
Single-Cell Transcriptomics
Example 5: Single-Cell Atlas of Tumor Microenvironment
Objective: Characterize immune cell populations in tumor microenvironment and identify immunotherapy biomarkers.
Skills Used:
scanpy- Single-cell analysisscvi-tools- Batch correction and integrationcellxgene-census- Reference datagene-database- Cell type markersanndata- Data structurearboreto- Gene regulatory networkspytorch-lightning- Deep learningmatplotlib- Visualizationstatistical-analysis- Hypothesis testing
Workflow:
Step 1: Load and QC 10X Genomics data
- Use Scanpy to read 10X h5 files
- Calculate QC metrics: n_genes, n_counts, pct_mitochondrial
- Identify mitochondrial genes (MT- prefix)
- Filter cells: 200 < n_genes < 5000, pct_mt < 20%
- Filter genes: expressed in at least 10 cells
- Document filtering criteria and cell retention rate
Step 2: Normalize and identify highly variable genes
- Normalize to 10,000 counts per cell
- Log-transform data (log1p)
- Store raw counts in adata.raw
- Identify 3,000 highly variable genes
- Regress out technical variation (n_counts, pct_mt)
- Scale to unit variance, clip at 10 standard deviations
Step 3: Integrate with reference atlas using scVI
- Download reference tumor microenvironment data from Cellxgene Census
- Train scVI model on combined dataset for batch correction
- Use scVI latent representation for downstream analysis
- Generate batch-corrected expression matrix
Step 4: Dimensionality reduction and clustering
- Compute neighborhood graph (n_neighbors=15, n_pcs=50)
- Calculate UMAP embedding for visualization
- Perform Leiden clustering at multiple resolutions (0.3, 0.5, 0.8)
- Select optimal resolution based on silhouette score
Step 5: Identify cell type markers
- Run differential expression for each cluster (Wilcoxon test)
- Calculate marker scores (log fold change, p-value, pct expressed)
- Query NCBI Gene for canonical immune cell markers:
* T cells: CD3D, CD3E, CD4, CD8A
* B cells: CD19, MS4A1 (CD20), CD79A
* Myeloid: CD14, CD68, CD163
* NK cells: NKG7, GNLY, NCAM1
* Dendritic: CD1C, CLEC9A, LILRA4
Step 6: Annotate cell types
- Assign cell type labels based on marker expression
- Refine annotations with CellTypist or manual curation
- Identify T cell subtypes: CD4+, CD8+, Tregs, exhausted T cells
- Characterize myeloid cells: M1/M2 macrophages, dendritic cells
- Create cell type proportion tables by sample/condition
Step 7: Identify tumor-specific features
- Compare tumor samples vs normal tissue (if available)
- Identify expanded T cell clones (high proliferation markers)
- Detect exhausted T cells (PDCD1, CTLA4, LAG3, HAVCR2)
- Characterize immunosuppressive populations (Tregs, M2 macrophages)
Step 8: Gene regulatory network inference
- Use Arboreto/GRNBoost2 on each major cell type
- Identify transcription factors driving cell states
- Focus on exhaustion TFs: TOX, TCF7, EOMES
- Build regulatory networks for visualization
Step 9: Statistical analysis of cell proportions
- Calculate cell type frequencies per sample
- Test for significant differences between groups (responders vs non-responders)
- Use statistical-analysis skill for appropriate tests (t-test, Mann-Whitney)
- Calculate effect sizes and confidence intervals
Step 10: Biomarker discovery for immunotherapy response
- Correlate cell type abundances with clinical response
- Identify gene signatures associated with response
- Test signatures: T cell exhaustion, antigen presentation, inflammation
- Validate with published immunotherapy response signatures
Step 11: Create comprehensive visualizations
- UMAP plots colored by: cell type, sample, treatment, key genes
- Dot plots of canonical markers across cell types
- Cell type proportion bar plots by condition
- Heatmap of top differentially expressed genes per cell type
- Gene regulatory network diagrams
- Volcano plots for differentially abundant cell types
Step 12: Generate scientific report
- Methods: QC, normalization, batch correction, clustering
- Results: Cell type composition, differential abundance, markers
- Biomarker analysis: Predictive signatures and validation
- High-quality figures suitable for publication
- Export processed h5ad file and PDF report
Expected Output:
- Annotated single-cell atlas with cell type labels
- Cell type composition analysis
- Biomarker signatures for immunotherapy response
- Gene regulatory networks for key cell states
- Comprehensive report with publication-quality figures
Protein Structure & Function
Example 6: Structure-Based Design of Protein-Protein Interaction Inhibitors
Objective: Design small molecules to disrupt a therapeutically relevant protein-protein interaction.
Skills Used:
alphafold-database- Protein structurespdb-database- Experimental structuresuniprot-database- Protein informationbiopython- Structure analysispyrosetta- Protein design (if available)rdkit- Chemical library generationdiffdock- Molecular dockingzinc-database- Screening librarydeepchem- Property predictionpymol- Visualization (external)
Workflow:
Step 1: Retrieve protein structures
- Query AlphaFold Database for both proteins in the interaction
- Download PDB files and confidence scores
- If available, get experimental structures from PDB database
- Compare AlphaFold predictions with experimental structures (if any)
Step 2: Analyze protein interaction interface
- Load structures with BioPython
- Identify interface residues (distance < 5Å between proteins)
- Calculate interface area and binding energy contribution
- Identify hot spot residues (key for binding)
- Map to UniProt to get functional annotations
Step 3: Characterize binding pocket
- Identify cavities at the protein-protein interface
- Calculate pocket volume and surface area
- Assess druggability: depth, hydrophobicity, shape
- Identify hydrogen bond donors/acceptors
- Note any known allosteric sites
Step 4: Query UniProt for known modulators
- Search UniProt for both proteins
- Extract information on known inhibitors or modulators
- Review PTMs that affect interaction
- Check disease-associated mutations in interface
Step 5: Search ZINC15 for fragment library
- Query ZINC for fragments matching pocket criteria:
* Molecular weight: 150-300 Da
* LogP: 0-3 (appropriate for PPI inhibitors)
* Exclude PAINS and aggregators
- Download 1,000-5,000 fragment SMILES
Step 6: Virtual screening with fragment library
- Use DiffDock to dock fragments into interface pocket
- Rank by predicted binding affinity
- Identify fragments binding to hot spot residues
- Select top 50 fragments for elaboration
Step 7: Fragment elaboration with RDKit
- For each fragment hit, generate elaborated molecules:
* Add substituents to core scaffold
* Merge fragments binding to adjacent pockets
* Apply medicinal chemistry filters
- Generate 20-50 analogs per fragment
- Filter by Lipinski's Ro5 and PPI-specific rules (MW 400-700)
Step 8: Second round of virtual screening
- Dock elaborated molecules with DiffDock
- Calculate binding energies and interaction patterns
- Prioritize molecules with:
* Strong binding to hot spot residues
* Multiple H-bonds and hydrophobic contacts
* Favorable predicted ΔG
Step 9: Predict ADMET properties with DeepChem
- Train models on ChEMBL data
- Predict: solubility, permeability, hERG liability
- Filter for drug-like properties
- Rank by overall score (affinity + ADMET)
Step 10: Literature and patent search
- PubMed: "[protein A] AND [protein B] AND inhibitor"
- USPTO: Check for prior art on top scaffolds
- Assess freedom to operate
- Identify any reported PPI inhibitors for this target
Step 11: Prepare molecules for synthesis
- Assess synthetic accessibility (SA score < 4)
- Identify commercial building blocks
- Propose synthetic routes for top 10 candidates
- Calculate estimated synthesis cost
Step 12: Generate comprehensive design report
- Interface analysis with hot spot identification
- Fragment screening results
- Top 10 designed molecules with predicted properties
- Docking poses and interaction diagrams
- Synthetic accessibility assessment
- Comparison to known PPI inhibitors
- Recommendations for experimental validation
- Publication-quality figures and PDF report
Expected Output:
- Interface characterization and hot spot analysis
- Ranked library of designed PPI inhibitors
- Predicted binding modes and affinities
- ADMET property predictions
- Synthetic accessibility assessment
- Comprehensive drug design report
Chemical Safety & Toxicology
Example 7: Predictive Toxicology Assessment
Objective: Assess potential toxicity and safety liabilities of drug candidates before synthesis.
Skills Used:
rdkit- Molecular descriptorsdeepchem- Toxicity predictionchembl-database- Toxicity datapubchem-database- Bioassay datadrugbank-database- Known drug toxicitiesfda-database- Adverse eventshmdb-database- Metabolite predictionscikit-learn- Classification modelsshap- Model interpretability
Workflow:
Step 1: Calculate molecular descriptors
- Load candidate molecules with RDKit
- Calculate physicochemical properties:
* MW, LogP, TPSA, rotatable bonds, H-bond donors/acceptors
* Aromatic rings, sp3 fraction, formal charge
- Calculate structural alerts:
* PAINS patterns
* Toxic functional groups (nitroaromatics, epoxides, etc.)
* Genotoxic alerts (Ames mutagenicity)
Step 2: Screen for known toxicophores
- Search for structural alerts using SMARTS patterns:
* Michael acceptors
* Aldehyde/ketone reactivity
* Quinones and quinone-like structures
* Thioureas and isocyanates
- Flag molecules with high-risk substructures
Step 3: Query ChEMBL for similar compounds with toxicity data
- Perform similarity search (Tanimoto > 0.7)
- Extract toxicity assay results:
* Cytotoxicity (IC50 values)
* Hepatotoxicity markers
* Cardiotoxicity (hERG inhibition)
* Genotoxicity (Ames test results)
- Analyze structure-toxicity relationships
Step 4: Search PubChem BioAssays for toxicity screening
- Query relevant assays:
* Tox21 panel (cell viability, stress response, genotoxicity)
* Liver toxicity assays
* hERG channel inhibition
- Extract activity data for similar compounds
- Calculate hit rates for concerning assays
Step 5: Train toxicity prediction models with DeepChem
- Load Tox21 dataset from DeepChem
- Train graph convolutional models for:
* Nuclear receptor signaling
* Stress response pathways
* Genotoxicity endpoints
- Validate models with cross-validation
- Predict toxicity for candidate molecules
Step 6: Predict hERG cardiotoxicity liability
- Train DeepChem model on hERG inhibition data from ChEMBL
- Predict IC50 for hERG channel
- Flag compounds with predicted IC50 < 10 μM
- Identify structural features associated with hERG liability
Step 7: Predict hepatotoxicity risk
- Train models on DILI (drug-induced liver injury) datasets
- Extract features: reactive metabolites, mitochondrial toxicity
- Predict hepatotoxicity risk class (low/medium/high)
- Use SHAP values to explain predictions
Step 8: Predict metabolic stability and metabolites
- Identify sites of metabolism using RDKit SMARTS patterns
- Predict CYP450 interactions
- Query HMDB for potential metabolite structures
- Assess if metabolites contain toxic substructures
- Predict metabolic stability (half-life)
Step 9: Check FDA adverse event database
- Query FAERS for approved drugs similar to candidates
- Extract common adverse events
- Identify target organ toxicities
- Calculate reporting odds ratios for serious events
Step 10: Literature review of toxicity mechanisms
- PubMed search: "[scaffold] AND (toxicity OR hepatotoxicity OR cardiotoxicity)"
- Identify mechanistic studies on similar compounds
- Note any case reports of adverse events
- Review preclinical and clinical safety data
Step 11: Assess ADME liabilities
- Predict solubility, permeability, plasma protein binding
- Identify potential drug-drug interaction risks
- Assess blood-brain barrier penetration (for CNS or non-CNS drugs)
- Evaluate metabolic stability
Step 12: Generate safety assessment report
- Executive summary of safety profile for each candidate
- Red flags: structural alerts, predicted toxicities
- Yellow flags: moderate concerns requiring testing
- Green light: acceptable predicted safety profile
- Comparison table of all candidates
- Recommendations for risk mitigation:
* Structural modifications to reduce toxicity
* Priority in vitro assays to run
* Preclinical study design recommendations
- Comprehensive PDF report with:
* Toxicophore analysis
* Prediction model results with confidence
* SHAP interpretation plots
* Literature evidence
* Risk assessment matrix
Expected Output:
- Toxicity predictions for all candidates
- Structural alert analysis
- hERG, hepatotoxicity, and genotoxicity risk scores
- Metabolite predictions
- Prioritized list with safety rankings
- Comprehensive toxicology assessment report
Clinical Trial Analysis
Example 8: Competitive Landscape Analysis for New Indication
Objective: Analyze the clinical trial landscape for a specific indication to inform development strategy.
Skills Used:
clinicaltrials-database- Trial registryfda-database- Drug approvalspubmed-database- Published resultsdrugbank-database- Approved drugsopentargets-database- Target validationpolars- Data manipulationmatplotlib- Visualizationseaborn- Statistical plotsreportlab- Report generation
Workflow:
Step 1: Search ClinicalTrials.gov for all trials in indication
- Query: "[disease/indication]"
- Filter: All phases, all statuses
- Extract fields:
* NCT ID, title, phase, status
* Start date, completion date, enrollment
* Intervention/drug names
* Primary/secondary outcomes
* Sponsor and collaborators
- Export to structured JSON/CSV
Step 2: Categorize trials by mechanism of action
- Extract drug names and intervention types
- Query DrugBank for mechanism of action
- Query Open Targets for target information
- Classify into categories:
* Small molecules vs biologics
* Target class (kinase inhibitor, antibody, etc.)
* Novel vs repurposing
Step 3: Analyze trial phase progression
- Calculate success rates by phase (I → II, II → III)
- Identify terminated trials and reasons for termination
- Track time from phase I start to NDA submission
- Calculate median development timelines
Step 4: Search FDA database for recent approvals
- Query FDA drug approvals in the indication (last 10 years)
- Extract approval dates, indications, priority review status
- Note any accelerated approvals or breakthroughs
- Review FDA drug labels for safety information
Step 5: Extract outcome measures
- Compile all primary endpoints used
- Identify most common endpoints:
* Survival (OS, PFS, DFS)
* Response rates (ORR, CR, PR)
* Biomarker endpoints
* Patient-reported outcomes
- Note emerging or novel endpoints
Step 6: Analyze competitive dynamics
- Identify leading companies and their pipelines
- Map trials by phase for each major competitor
- Note partnership and licensing deals
- Assess crowded vs underserved patient segments
Step 7: Search PubMed for published trial results
- Query: "[NCT ID]" for each completed trial
- Extract published outcomes and conclusions
- Identify trends in efficacy and safety
- Note any unmet needs highlighted in discussions
Step 8: Analyze target validation evidence
- Query Open Targets for target-disease associations
- Extract genetic evidence scores
- Review tractability assessments
- Compare targets being pursued across trials
Step 9: Identify unmet needs and opportunities
- Analyze trial failures for common patterns
- Identify patient populations excluded from trials
- Note resistance mechanisms or limitations mentioned
- Assess gaps in current therapeutic approaches
Step 10: Perform temporal trend analysis
- Plot trial starts over time (by phase, mechanism)
- Identify increasing or decreasing interest in targets
- Correlate with publication trends and scientific advances
- Predict future trends in the space
Step 11: Create comprehensive visualizations
- Timeline of all trials (Gantt chart style)
- Phase distribution pie chart
- Mechanism of action breakdown
- Geographic distribution of trials
- Enrollment trends over time
- Success rate funnels (Phase I → II → III → Approval)
- Sponsor/company market share
Step 12: Generate competitive intelligence report
- Executive summary of competitive landscape
- Total number of active programs by phase
- Key players and their development stage
- Standard of care and approved therapies
- Emerging approaches and novel targets
- Identified opportunities and white space
- Risk analysis (crowded targets, high failure rates)
- Strategic recommendations:
* Patient population to target
* Differentiation strategies
* Partnership opportunities
* Regulatory pathway considerations
- Export as professional PDF with citations and data tables
Expected Output:
- Comprehensive trial database for indication
- Success rate and timeline statistics
- Competitive landscape mapping
- Unmet need analysis
- Strategic recommendations
- Publication-ready report with visualizations
Metabolomics & Systems Biology
Example 9: Multi-Omics Integration for Metabolic Disease
Objective: Integrate transcriptomics, proteomics, and metabolomics to identify dysregulated pathways in metabolic disease.
Skills Used:
pydeseq2- RNA-seq analysispyopenms- Mass spectrometryhmdb-database- Metabolite identificationmetabolomics-workbench-database- Public datasetskegg-database- Pathway mappingreactome-database- Pathway analysisstring-database- Protein interactionsstatsmodels- Multi-omics correlationnetworkx- Network analysispymc- Bayesian modeling
Workflow:
Step 1: Process RNA-seq data
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare disease vs control (adjusted p < 0.05, |LFC| > 1)
- Extract gene symbols and fold changes
- Map to KEGG gene IDs
Step 2: Process proteomics data
- Load LC-MS/MS results with PyOpenMS
- Perform peptide identification and quantification
- Normalize protein abundances
- Run statistical testing (t-test or limma)
- Extract significant proteins (p < 0.05, |FC| > 1.5)
Step 3: Process metabolomics data
- Load untargeted metabolomics data (mzML format) with PyOpenMS
- Perform peak detection and alignment
- Match features to HMDB database by accurate mass
- Annotate metabolites with MS/MS fragmentation
- Extract putative identifications (Level 2/3)
- Perform statistical analysis (FDR < 0.05, |FC| > 2)
Step 4: Search Metabolomics Workbench for public data
- Query for same disease or tissue type
- Download relevant studies
- Reprocess for consistency with own data
- Use as validation cohort
Step 5: Map all features to KEGG pathways
- Map genes to KEGG orthology (KO) terms
- Map proteins to KEGG identifiers
- Map metabolites to KEGG compound IDs
- Identify pathways with multi-omics coverage
Step 6: Perform pathway enrichment analysis
- Test for enrichment in KEGG pathways
- Test for enrichment in Reactome pathways
- Apply Fisher's exact test with multiple testing correction
- Focus on pathways with hits in ≥2 omics layers
Step 7: Build protein-metabolite networks
- Query STRING for protein-protein interactions
- Map proteins to KEGG reactions
- Connect enzymes to their substrates/products
- Build integrated network with genes → proteins → metabolites
Step 8: Network topology analysis with NetworkX
- Calculate node centrality (degree, betweenness)
- Identify hub metabolites and key enzymes
- Find bottleneck reactions
- Detect network modules with community detection
- Identify dysregulated subnetworks
Step 9: Correlation analysis across omics layers
- Calculate Spearman correlations between:
* Gene expression and protein abundance
* Protein abundance and metabolite levels
* Gene expression and metabolites (for enzyme-product pairs)
- Use statsmodels for significance testing
- Focus on enzyme-metabolite pairs with expected relationships
Step 10: Bayesian network modeling with PyMC
- Build probabilistic graphical model of pathway
- Model causal relationships: gene → protein → metabolite
- Incorporate prior knowledge from KEGG/Reactome
- Perform inference to identify key regulatory nodes
- Estimate effect sizes and uncertainties
Step 11: Identify therapeutic targets
- Prioritize enzymes with:
* Significant changes in all three omics layers
* High network centrality
* Druggable target class (kinases, transporters, etc.)
- Query DrugBank for existing inhibitors
- Search PubMed for validation in disease models
Step 12: Create comprehensive multi-omics report
- Summary statistics for each omics layer
- Venn diagram of overlapping pathway hits
- Pathway enrichment dot plots
- Integrated network visualization (color by fold change)
- Correlation heatmaps (enzyme-metabolite pairs)
- Bayesian network structure
- Table of prioritized therapeutic targets
- Biological interpretation and mechanistic insights
- Generate publication-quality figures
- Export PDF report with all results
Expected Output:
- Integrated multi-omics dataset
- Dysregulated pathway identification
- Multi-omics network model
- Prioritized list of therapeutic targets
- Comprehensive systems biology report
Materials Science & Chemistry
Example 10: High-Throughput Materials Discovery for Battery Applications
Objective: Discover novel solid electrolyte materials for lithium-ion batteries using computational screening.
Skills Used:
pymatgen- Materials analysismatminer- Feature engineeringscikit-learn- Machine learningpymoo- Multi-objective optimizationase- Atomic simulationsympy- Symbolic mathvaex- Large dataset handlingmatplotlib- Visualizationscientific-writing- Report generation
Workflow:
Step 1: Generate candidate materials library
- Use Pymatgen to enumerate compositions:
* Li-containing compounds (Li₁₋ₓM₁₊ₓX₂)
* M = transition metals (Zr, Ti, Ta, Nb)
* X = O, S, Se
- Generate ~10,000 candidate compositions
- Apply charge neutrality constraints
Step 2: Filter by thermodynamic stability
- Query Materials Project database via Pymatgen
- Calculate formation energy from elements
- Calculate energy above convex hull (E_hull)
- Filter: E_hull < 50 meV/atom (likely stable)
- Retain ~2,000 thermodynamically plausible compounds
Step 3: Predict crystal structures
- Use Pymatgen structure predictor
- Generate most likely crystal structures for each composition
- Consider common structure types: LISICON, NASICON, garnet, perovskite
- Calculate structural descriptors
Step 4: Calculate material properties with Pymatgen
- Lattice parameters and volume
- Density
- Packing fraction
- Ionic radii and bond lengths
- Coordination environments
Step 5: Feature engineering with matminer
- Calculate compositional features:
* Elemental property statistics (electronegativity, ionic radius)
* Valence electron concentrations
* Stoichiometric attributes
- Calculate structural features:
* Pore size distribution
* Site disorder parameters
* Partial radial distribution functions
Step 6: Build ML models for Li⁺ conductivity prediction
- Collect training data from literature (experimental conductivities)
- Train ensemble models with scikit-learn:
* Random Forest
* Gradient Boosting
* Neural Network
- Use 5-fold cross-validation
- Predict ionic conductivity for all candidates
Step 7: Predict additional properties
- Electrochemical stability window (ML model)
- Mechanical properties (bulk modulus, shear modulus)
- Interfacial resistance (estimate from structure)
- Synthesis temperature (ML prediction from similar compounds)
Step 8: Multi-objective optimization with PyMOO
Define optimization objectives:
- Maximize: ionic conductivity (>10⁻³ S/cm target)
- Maximize: electrochemical window (>4.5V target)
- Minimize: synthesis temperature (<800°C preferred)
- Minimize: cost (based on elemental abundance)
Run NSGA-II to find Pareto optimal solutions
Extract top 50 candidates from Pareto front
Step 9: Analyze Pareto optimal materials
- Identify composition trends (which elements appear frequently)
- Analyze structure-property relationships
- Calculate trade-offs between objectives
- Identify "sweet spot" compositions
Step 10: Validate predictions with DFT calculations
- Select top 10 candidates for detailed study
- Set up DFT calculations (VASP-like, if available via ASE)
- Calculate:
* Accurate formation energies
* Li⁺ migration barriers (NEB calculations)
* Electronic band gap
* Elastic constants
- Compare DFT results with ML predictions
Step 11: Literature and patent search
- Search for prior art on top candidates
- PubMed and Google Scholar: "[composition] AND electrolyte"
- USPTO: Check for existing patents on similar compositions
- Identify any experimental reports on related materials
Step 12: Generate materials discovery report
- Summary of screening workflow and statistics
- Pareto front visualization (conductivity vs stability vs cost)
- Structure visualization of top candidates
- Property comparison table
- Composition-property trend analysis
- DFT validation results
- Predicted performance vs state-of-art materials
- Synthesis recommendations
- IP landscape summary
- Prioritized list of 5-10 materials for experimental validation
- Export as publication-ready PDF
Expected Output:
- Screened library of 10,000+ materials
- ML models for property prediction
- Pareto-optimal set of 50 candidates
- Detailed analysis of top 10 materials
- DFT validation results
- Comprehensive materials discovery report
Digital Pathology
Example 11: Automated Tumor Detection in Whole Slide Images
Objective: Develop and validate a deep learning model for automated tumor detection in histopathology images.
Skills Used:
histolab- Whole slide image processingpathml- Computational pathologypytorch-lightning- Deep learningtorchvision- Image modelsscikit-learn- Model evaluationpydicom- DICOM handlingomero-integration- Image managementmatplotlib- Visualizationshap- Model interpretability
Workflow:
Step 1: Load whole slide images with HistoLab
- Load WSI files (SVS, TIFF formats)
- Extract slide metadata and magnification levels
- Visualize slide thumbnails
- Inspect tissue area vs background
Step 2: Tile extraction and preprocessing
- Use HistoLab to extract tiles (256×256 pixels at 20× magnification)
- Filter tiles:
* Remove background (tissue percentage > 80%)
* Apply color normalization (Macenko or Reinhard method)
* Filter out artifacts and bubbles
- Extract ~100,000 tiles per slide across all slides
Step 3: Create annotations (if training from scratch)
- Load pathologist annotations (if available via OMERO)
- Convert annotations to tile-level labels
- Categories: tumor, stroma, necrosis, normal
- Balance classes through stratified sampling
Step 4: Set up PathML pipeline
- Create PathML SlideData objects
- Define preprocessing pipeline:
* Stain normalization
* Color augmentation (HSV jitter)
* Rotation and flipping
- Split data: 70% train, 15% validation, 15% test
Step 5: Build deep learning model with PyTorch Lightning
- Architecture: ResNet50 or EfficientNet backbone
- Add custom classification head for tissue types
- Define training pipeline:
* Loss function: Cross-entropy or Focal loss
* Optimizer: Adam with learning rate scheduling
* Augmentations: rotation, flip, color jitter, elastic deformation
* Batch size: 32
* Mixed precision training
Step 6: Train model
- Train on tile-level labels
- Monitor metrics: accuracy, F1 score, AUC
- Use early stopping on validation loss
- Save best model checkpoint
- Training time: ~6-12 hours on GPU
Step 7: Evaluate model performance
- Test on held-out test set
- Calculate metrics with scikit-learn:
* Accuracy, precision, recall, F1 per class
* Confusion matrix
* ROC curves and AUC
- Compute confidence intervals with bootstrapping
Step 8: Slide-level aggregation
- Apply model to all tiles in each test slide
- Aggregate predictions:
* Majority voting
* Weighted average by confidence
* Spatial smoothing with convolution
- Generate probability heatmaps overlaid on WSI
Step 9: Model interpretability with SHAP
- Apply GradCAM or SHAP to explain predictions
- Visualize which regions contribute to tumor classification
- Generate attention maps showing model focus
- Validate that model attends to relevant histological features
Step 10: Clinical validation
- Compare model predictions with pathologist diagnosis
- Calculate inter-rater agreement (kappa score)
- Identify discordant cases for review
- Analyze error types: false positives, false negatives
Step 11: Integration with OMERO
- Upload processed slides and heatmaps to OMERO server
- Attach model predictions as slide metadata
- Enable pathologist review interface
- Store annotations and corrections for model retraining
Step 12: Generate clinical validation report
- Model architecture and training details
- Performance metrics with confidence intervals
- Slide-level accuracy vs pathologist ground truth
- Heatmap visualizations for representative cases
- Analysis of failure modes
- Comparison with published methods
- Discussion of clinical applicability
- Recommendations for deployment and monitoring
- Export PDF report for regulatory submission (if needed)
Expected Output:
- Trained deep learning model for tumor detection
- Tile-level and slide-level predictions
- Probability heatmaps for visualization
- Performance metrics and validation results
- Model interpretation visualizations
- Clinical validation report
Lab Automation & Protocol Design
Example 12: Automated High-Throughput Screening Protocol
Objective: Design and execute an automated compound screening workflow using liquid handling robots.
Skills Used:
pylabrobot- Lab automationopentrons-integration- Opentrons protocolbenchling-integration- Sample trackingprotocolsio-integration- Protocol documentationsimpy- Process simulationpolars- Data processingmatplotlib- Plate visualizationreportlab- Report generation
Workflow:
Step 1: Define screening campaign in Benchling
- Create compound library in Benchling registry
- Register all compounds with structure, concentration, location
- Define plate layouts (384-well format)
- Track compound source plates in inventory
- Set up ELN entry for campaign documentation
Step 2: Design assay protocol
- Define assay steps:
* Dispense cells (5000 cells/well)
* Add compounds (dose-response curve, 10 concentrations)
* Incubate 48 hours at 37°C
* Add detection reagent (cell viability assay)
* Read luminescence signal
- Calculate required reagent volumes
- Document protocol in Protocols.io
- Share with team for review
Step 3: Simulate workflow with SimPy
- Model liquid handler, incubator, plate reader as resources
- Simulate timing for 20 plates (7,680 wells)
- Identify bottlenecks (plate reader reads take 5 min/plate)
- Optimize scheduling: stagger plate processing
- Validate that throughput goal is achievable (20 plates/day)
Step 4: Design plate layout
- Use PyLabRobot to generate plate maps:
* Columns 1-2: positive controls (DMSO)
* Columns 3-22: compound titrations (10 concentrations in duplicate)
* Columns 23-24: negative controls (cytotoxic control)
- Randomize compound positions across plates
- Account for edge effects (avoid outer wells for samples)
- Export plate maps to CSV
Step 5: Create Opentrons protocol for cell seeding
- Write Python protocol using Opentrons API 2.0
- Steps:
* Aspirate cells from reservoir
* Dispense 40 μL cell suspension per well
* Tips: use P300 multi-channel for speed
* Include mixing steps to prevent settling
- Simulate protocol in Opentrons app
- Test on one plate before full run
Step 6: Create Opentrons protocol for compound addition
- Acoustic liquid handler (Echo) or pin tool for nanoliter transfers
- If using Opentrons:
* Source: 384-well compound plates
* Transfer 100 nL compound (in DMSO) to assay plates
* Use P20 for precision
* Prepare serial dilutions on deck if needed
- Account for DMSO normalization (1% final)
Step 7: Integrate with Benchling for sample tracking
- Use Benchling API to:
* Retrieve compound information (structure, batch, concentration)
* Log plate creation in inventory
* Create transfer records for audit trail
* Link assay plates to ELN entry
Step 8: Execute automated workflow
- Day 1: Seed cells with Opentrons
- Day 1 (4h later): Add compounds with Opentrons
- Day 3: Add detection reagent (manual or automated)
- Day 3 (2h later): Read plates on plate reader
- Store plates at 4°C between steps
Step 9: Collect and process data
- Export raw luminescence data from plate reader
- Load data with Polars for fast processing
- Normalize data:
* Subtract background (media-only wells)
* Calculate % viability relative to DMSO control
* Apply plate-wise normalization to correct systematic effects
- Quality control:
* Z' factor calculation (> 0.5 for acceptable assay)
* Coefficient of variation for controls (< 10%)
* Flag plates with poor QC metrics
Step 10: Dose-response curve fitting
- Fit 4-parameter logistic curves for each compound
- Calculate IC50, Hill slope, max/min response
- Use scikit-learn or scipy for curve fitting
- Compute 95% confidence intervals
- Flag compounds with poor curve fits (R² < 0.8)
Step 11: Hit identification and triage
- Define hit criteria:
* IC50 < 10 μM
* Max inhibition > 50%
* Curve quality: R² > 0.8
- Prioritize hits by potency
- Check for PAINS patterns with RDKit
- Cross-reference with known aggregators/frequent hitters
Step 12: Visualize results and generate report
- Create plate heatmaps showing % viability
- Dose-response curve plots for hits
- Scatter plot: potency vs max effect
- QC metric summary across plates
- Structure visualization of top 20 hits
- Generate campaign summary report:
* Screening statistics (compounds tested, hit rate)
* QC metrics and data quality assessment
* Hit list with structures and IC50 values
* Protocol documentation from Protocols.io
* Raw data files and analysis code
* Recommendations for confirmation assays
- Update Benchling ELN with results
- Export PDF report for stakeholders
Expected Output:
- Automated screening protocols (Opentrons Python files)
- Executed screen of 384-well plates
- Quality-controlled dose-response data
- Hit list with IC50 values
- Comprehensive screening report
Agricultural Genomics
Example 13: GWAS for Crop Yield Improvement
Objective: Identify genetic markers associated with drought tolerance and yield in a crop species.
Skills Used:
biopython- Sequence analysispysam- VCF processinggwas-database- Public GWAS dataensembl-database- Plant genomicsgene-database- Gene annotationscanpy- Population structure (adapted for genetic data)scikit-learn- PCA and clusteringstatsmodels- Association testingmatplotlib- Manhattan plotsseaborn- Visualization
Workflow:
Step 1: Load and QC genotype data
- Load VCF file with pysam
- Filter variants:
* Call rate > 95%
* Minor allele frequency (MAF) > 5%
* Hardy-Weinberg equilibrium p > 1e-6
- Convert to numeric genotype matrix (0, 1, 2)
- Retain ~500,000 SNPs after QC
Step 2: Assess population structure
- Calculate genetic relationship matrix
- Perform PCA with scikit-learn (use top 10 PCs)
- Visualize population structure (PC1 vs PC2)
- Identify distinct subpopulations or admixture
- Note: will use PCs as covariates in GWAS
Step 3: Load and process phenotype data
- Drought tolerance score (1-10 scale, measured under stress)
- Grain yield (kg/hectare)
- Days to flowering
- Plant height
- Quality control:
* Remove outliers (> 3 SD from mean)
* Transform if needed (log or rank-based for skewed traits)
* Adjust for environmental covariates (field, year)
Step 4: Calculate kinship matrix
- Compute genetic relatedness matrix
- Account for population structure and relatedness
- Will use in mixed linear model to control for confounding
Step 5: Run genome-wide association study
- For each phenotype, test association with each SNP
- Use mixed linear model (MLM) in statsmodels:
* Fixed effects: SNP genotype, PCs (top 10)
* Random effects: kinship matrix
* Bonferroni threshold: p < 5e-8 (genome-wide significance)
- Multiple testing correction: Bonferroni or FDR
- Calculate genomic inflation factor (λ) to check for inflation
Step 6: Identify significant associations
- Extract SNPs passing significance threshold
- Determine lead SNPs (most significant in each locus)
- Define loci: extend ±500 kb around lead SNP
- Identify independent associations via conditional analysis
Step 7: Annotate significant loci
- Map SNPs to genes using Ensembl Plants API
- Identify genic vs intergenic SNPs
- For genic SNPs:
* Determine consequence (missense, synonymous, intronic, UTR)
* Extract gene names and descriptions
- Query NCBI Gene for gene function
- Prioritize genes with known roles in stress response or development
Step 8: Search GWAS Catalog for prior reports
- Query GWAS Catalog for similar traits in same or related species
- Check for replication of known loci
- Identify novel vs known associations
Step 9: Functional enrichment analysis
- Extract all genes within significant loci
- Perform GO enrichment analysis
- Test for enrichment in KEGG pathways
- Focus on pathways related to:
* Drought stress response (ABA signaling, osmotic adjustment)
* Photosynthesis and carbon fixation
* Root development
Step 10: Estimate SNP heritability and genetic architecture
- Calculate variance explained by significant SNPs
- Estimate SNP-based heritability (proportion of variance explained)
- Assess genetic architecture: few large-effect vs many small-effect loci
Step 11: Build genomic prediction model
- Train genomic selection model with scikit-learn:
* Ridge regression (GBLUP equivalent)
* Elastic net
* Random Forest
- Use all SNPs (not just significant ones)
- Cross-validate to predict breeding values
- Assess prediction accuracy
Step 12: Generate GWAS report
- Manhattan plots for each trait
- QQ plots to assess test calibration
- Regional association plots for significant loci
- Gene models overlaid on loci
- Table of significant SNPs with annotations
- Functional enrichment results
- Genomic prediction accuracy
- Biological interpretation:
* Candidate genes for drought tolerance
* Potential molecular mechanisms
* Implications for breeding programs
- Recommendations:
* SNPs to use for marker-assisted selection
* Genes for functional validation
* Crosses to generate mapping populations
- Export publication-quality PDF with all results
Expected Output:
- Significant SNP-trait associations
- Annotated candidate genes
- Functional enrichment analysis
- Genomic prediction models
- Comprehensive GWAS report
- Recommendations for breeding programs
Neuroscience & Brain Imaging
Example 14: Brain Connectivity Analysis from fMRI Data
Objective: Analyze resting-state fMRI data to identify altered brain connectivity patterns in disease.
Skills Used:
neurokit2- Neurophysiological signal processingnilearn(external) - Neuroimaging analysisscikit-learn- Classification and clusteringnetworkx- Graph theory analysisstatsmodels- Statistical testingtorch_geometric- Graph neural networkspymc- Bayesian modelingmatplotlib- Brain visualizationseaborn- Connectivity matrices
Workflow:
Step 1: Load and preprocess fMRI data
# Note: Use nilearn or similar for fMRI-specific preprocessing
- Load 4D fMRI images (BOLD signal)
- Preprocessing:
* Motion correction (realignment)
* Slice timing correction
* Spatial normalization to MNI space
* Smoothing (6mm FWHM Gaussian kernel)
* Temporal filtering (0.01-0.1 Hz bandpass)
* Nuisance regression (motion, CSF, white matter)
Step 2: Define brain regions (parcellation)
- Apply brain atlas (e.g., AAL, Schaefer 200-region atlas)
- Extract average time series for each region
- Result: 200 time series per subject (one per brain region)
Step 3: Signal cleaning with NeuroKit2
- Denoise time series
- Remove physiological artifacts
- Apply additional bandpass filtering if needed
- Identify and handle outlier time points
Step 4: Calculate functional connectivity
- Compute pairwise Pearson correlations between all regions
- Result: 200×200 connectivity matrix per subject
- Fisher z-transform correlations for group statistics
- Threshold weak connections (|r| < 0.2)
Step 5: Graph theory analysis with NetworkX
- Convert connectivity matrices to graphs
- Calculate global network metrics:
* Clustering coefficient (local connectivity)
* Path length (integration)
* Small-worldness (balance of segregation and integration)
* Modularity (community structure)
- Calculate node-level metrics:
* Degree centrality
* Betweenness centrality
* Eigenvector centrality
* Participation coefficient (inter-module connectivity)
Step 6: Statistical comparison between groups
- Compare patients vs healthy controls
- Use statsmodels for group comparisons:
* Paired or unpaired t-tests for connectivity edges
* FDR correction for multiple comparisons across all edges
* Identify edges with significantly different connectivity
- Compare global and node-level network metrics
- Calculate effect sizes (Cohen's d)
Step 7: Identify altered subnetworks
- Threshold statistical maps (FDR < 0.05)
- Identify clusters of altered connectivity
- Map to functional brain networks:
* Default mode network (DMN)
* Salience network (SN)
* Central executive network (CEN)
* Sensorimotor network
- Visualize altered connections on brain surfaces
Step 8: Machine learning classification
- Train classifier to distinguish patients from controls
- Use scikit-learn Random Forest or SVM
- Features: connectivity values or network metrics
- Cross-validation (10-fold)
- Calculate accuracy, sensitivity, specificity, AUC
- Identify most discriminative features (connectivity edges)
Step 9: Graph neural network analysis with Torch Geometric
- Build graph neural network (GCN or GAT)
- Input: connectivity matrices as adjacency matrices
- Train to predict diagnosis
- Extract learned representations
- Visualize latent space (UMAP)
- Interpret which brain regions are most important
Step 10: Bayesian network modeling with PyMC
- Build directed graphical model of brain networks
- Estimate effective connectivity (directional influence)
- Incorporate prior knowledge about anatomical connections
- Perform posterior inference
- Identify key driver regions in disease
Step 11: Clinical correlation analysis
- Correlate network metrics with clinical scores:
* Symptom severity
* Cognitive performance
* Treatment response
- Use Spearman or Pearson correlation
- Identify brain-behavior relationships
Step 12: Generate comprehensive neuroimaging report
- Brain connectivity matrices (patients vs controls)
- Statistical comparison maps on brain surface
- Network metric comparison bar plots
- Graph visualizations (circular or force-directed layout)
- Machine learning ROC curves
- Brain-behavior correlation plots
- Clinical interpretation:
* Which networks are disrupted?
* Relationship to symptoms
* Potential biomarker utility
- Recommendations:
* Brain regions for therapeutic targeting (TMS, DBS)
* Network metrics as treatment response predictors
- Export publication-ready PDF with brain visualizations
Expected Output:
- Functional connectivity matrices for all subjects
- Statistical maps of altered connectivity
- Graph theory metrics
- Machine learning classification model
- Brain-behavior correlations
- Comprehensive neuroimaging report
Environmental Microbiology
Example 15: Metagenomic Analysis of Environmental Samples
Objective: Characterize microbial community composition and functional potential from environmental DNA samples.
Skills Used:
biopython- Sequence processingpysam- BAM file handlingena-database- Sequence datauniprot-database- Protein annotationkegg-database- Pathway analysisetetoolkit- Phylogenetic treesscikit-bio- Microbial ecologynetworkx- Co-occurrence networksstatsmodels- Diversity statisticsmatplotlib- Visualization
Workflow:
Step 1: Load and QC metagenomic reads
- Load FASTQ files with BioPython
- Quality control with FastQC-equivalent:
* Remove adapters and low-quality bases (Q < 20)
* Filter short reads (< 50 bp)
* Remove host contamination (if applicable)
- Subsample to even depth if comparing samples
Step 2: Taxonomic classification
- Use Kraken2-like approach or query ENA database
- Classify reads to taxonomic lineages
- Generate abundance table:
* Rows: taxa (species or OTUs)
* Columns: samples
* Values: read counts or relative abundance
- Summarize at different levels: phylum, class, order, family, genus, species
Step 3: Calculate diversity metrics with scikit-bio
- Alpha diversity (within-sample):
* Richness (number of species)
* Shannon entropy
* Simpson diversity
* Chao1 estimated richness
- Beta diversity (between-sample):
* Bray-Curtis dissimilarity
* Weighted/unweighted UniFrac distance
* Jaccard distance
- Rarefaction curves to assess sampling completeness
Step 4: Statistical comparison of communities
- Compare diversity between groups (e.g., polluted vs pristine)
- Use statsmodels for:
* Mann-Whitney or Kruskal-Wallis tests (alpha diversity)
* PERMANOVA for beta diversity (adonis test)
* LEfSe for differential abundance testing
- Identify taxa enriched or depleted in each condition
Step 5: Build phylogenetic tree with ETE Toolkit
- Extract 16S rRNA sequences (or marker genes)
- Align sequences (MUSCLE/MAFFT equivalent)
- Build phylogenetic tree (neighbor-joining or maximum likelihood)
- Visualize tree colored by sample or environment
- Root tree with outgroup
Step 6: Co-occurrence network analysis
- Calculate pairwise correlations between taxa
- Use Spearman correlation to identify co-occurrence patterns
- Filter significant correlations (p < 0.01, |r| > 0.6)
- Build co-occurrence network with NetworkX
- Identify modules (communities of co-occurring taxa)
- Calculate network topology metrics
- Visualize network (nodes = taxa, edges = correlations)
Step 7: Functional annotation
- Assemble contigs from reads (if performing assembly)
- Predict genes with Prodigal-like tools
- Annotate genes using UniProt and KEGG
- Map proteins to KEGG pathways
- Generate functional profile:
* Abundance of metabolic pathways
* Key enzymes (nitrification, denitrification, methanogenesis)
* Antibiotic resistance genes
* Virulence factors
Step 8: Functional diversity analysis
- Compare functional profiles between samples
- Calculate pathway richness and evenness
- Identify enriched pathways with statistical testing
- Link taxonomy to function:
* Which taxa contribute to which functions?
* Use shotgun data to assign functions to taxa
Step 9: Search ENA for related environmental samples
- Query ENA for metagenomic studies from similar environments
- Download and compare to own samples
- Place samples in context of global microbiome diversity
- Identify unique vs ubiquitous taxa
Step 10: Environmental parameter correlation
- Correlate community composition with metadata:
* Temperature, pH, salinity
* Nutrient concentrations (N, P)
* Pollutant levels (heavy metals, hydrocarbons)
- Use Mantel test to correlate distance matrices
- Identify environmental drivers of community structure
Step 11: Biomarker discovery
- Identify taxa or pathways that correlate with environmental condition
- Use Random Forest to find predictive features
- Validate biomarkers:
* Sensitivity and specificity
* Cross-validation across samples
- Propose taxa as bioindicators of environmental health
Step 12: Generate environmental microbiome report
- Taxonomic composition bar charts (stacked by phylum/class)
- Alpha and beta diversity plots (boxplots, PCoA)
- Phylogenetic tree with environmental context
- Co-occurrence network visualization
- Functional pathway heatmaps
- Environmental correlation plots
- Statistical comparison tables
- Biological interpretation:
* Dominant taxa and their ecological roles
* Functional potential of the community
* Environmental factors shaping the microbiome
* Biomarker taxa for monitoring
- Recommendations:
* Biomarkers for environmental monitoring
* Functional guilds for restoration
* Further sampling or sequencing strategies
- Export comprehensive PDF report
Expected Output:
- Taxonomic profiles for all samples
- Diversity metrics and statistical comparisons
- Phylogenetic tree
- Co-occurrence network
- Functional annotation and pathway analysis
- Comprehensive microbiome report
Infectious Disease Research
Example 16: Antimicrobial Resistance Surveillance and Prediction
Objective: Track antimicrobial resistance trends and predict resistance phenotypes from genomic data.
Skills Used:
biopython- Sequence analysispysam- Genome assembly analysisena-database- Public genomic datauniprot-database- Resistance protein annotationgene-database- Resistance gene catalogsetetoolkit- Phylogenetic analysisscikit-learn- Resistance predictionnetworkx- Transmission networksstatsmodels- Trend analysismatplotlib- Epidemiological plots
Workflow:
Step 1: Collect bacterial genome sequences
- Isolates from hospital surveillance program
- Load FASTA assemblies with BioPython
- Basic QC:
* Assess assembly quality (N50, completeness)
* Estimate genome size and coverage
* Remove contaminated assemblies
Step 2: Species identification and MLST typing
- Perform in silico MLST (multi-locus sequence typing)
- Extract housekeeping gene sequences
- Assign sequence types (ST)
- Classify isolates into clonal complexes
- Identify high-risk clones (e.g., ST131 E. coli, ST258 K. pneumoniae)
Step 3: Antimicrobial resistance (AMR) gene detection
- Query NCBI Gene and UniProt for AMR gene databases
- Screen assemblies for resistance genes:
* Beta-lactamases (blaTEM, blaCTX-M, blaKPC, blaNDM)
* Aminoglycoside resistance (aac, aph, ant)
* Fluoroquinolone resistance (gyrA, parC mutations)
* Colistin resistance (mcr-1 to mcr-10)
* Efflux pumps
- Calculate gene presence/absence matrix
Step 4: Resistance mechanism annotation
- Map detected genes to resistance classes:
* Enzymatic modification (e.g., beta-lactamases)
* Target modification (e.g., ribosomal methylation)
* Target mutation (e.g., fluoroquinolone resistance)
* Efflux pumps
- Query UniProt for detailed mechanism descriptions
- Link genes to antibiotic classes affected
Step 5: Build phylogenetic tree with ETE Toolkit
- Extract core genome SNPs
- Concatenate SNP alignments
- Build maximum likelihood tree
- Root with outgroup or midpoint rooting
- Annotate tree with:
* Resistance profiles
* Sequence types
* Collection date and location
Step 6: Genotype-phenotype correlation
- Match genomic data with phenotypic susceptibility testing
- For each antibiotic, correlate:
* Presence of resistance genes with MIC values
* Target mutations with resistance phenotype
- Calculate sensitivity/specificity of genetic markers
- Identify discordant cases (false positives/negatives)
Step 7: Machine learning resistance prediction
- Train classification models with scikit-learn:
* Features: presence/absence of resistance genes + mutations
* Target: resistance phenotype (susceptible/intermediate/resistant)
* Models: Logistic Regression, Random Forest, Gradient Boosting
- Train separate models for each antibiotic
- Cross-validate (stratified 5-fold)
- Calculate accuracy, precision, recall, F1 score
- Feature importance: which genes are most predictive?
Step 8: Temporal trend analysis
- Track resistance rates over time
- Use statsmodels for:
* Mann-Kendall trend test
* Joinpoint regression (identify change points)
* Forecast future resistance rates (ARIMA)
- Analyze trends for each antibiotic class
- Identify emerging resistance mechanisms
Step 9: Transmission network inference
- Identify closely related isolates (< 10 SNPs difference)
- Build transmission network with NetworkX:
* Nodes: isolates
* Edges: putative transmission links
- Incorporate temporal and spatial data
- Identify outbreak clusters
- Detect super-spreaders (high degree nodes)
- Analyze network topology
Step 10: Search ENA for global context
- Query ENA for same species from other regions/countries
- Download representative genomes
- Integrate into phylogenetic analysis
- Assess whether local isolates are globally distributed clones
- Identify region-specific vs international resistance genes
Step 11: Plasmid and mobile element analysis
- Identify plasmid contigs
- Detect insertion sequences and transposons
- Track mobile genetic elements carrying resistance genes
- Identify conjugative plasmids facilitating horizontal gene transfer
- Build plasmid similarity networks
Step 12: Generate AMR surveillance report
- Summary statistics:
* Number of isolates by species, ST, location
* Resistance rates for each antibiotic
- Phylogenetic tree annotated with resistance profiles
- Temporal trend plots (resistance % over time)
- Transmission network visualizations
- Prediction model performance metrics
- Heatmap: resistance genes by isolate
- Geographic distribution map (if spatial data available)
- Interpretation:
* Predominant resistance mechanisms
* High-risk clones circulating
* Temporal trends and emerging threats
* Transmission clusters and outbreaks
- Recommendations:
* Infection control measures for clusters
* Antibiotic stewardship priorities
* Resistance genes to monitor
* Laboratories to perform confirmatory testing
- Export comprehensive PDF for public health reporting
Expected Output:
- AMR gene profiles for all isolates
- Phylogenetic tree with resistance annotations
- Temporal trends in resistance rates
- ML models for resistance prediction from genomes
- Transmission networks
- Comprehensive AMR surveillance report for public health
Multi-Omics Integration
Example 17: Integrative Analysis of Cancer Multi-Omics Data
Objective: Integrate genomics, transcriptomics, proteomics, and clinical data to identify cancer subtypes and therapeutic strategies.
Skills Used:
pydeseq2- RNA-seq DE analysispysam- Variant callingensembl-database- Gene annotationcosmic-database- Cancer mutationsstring-database- Protein interactionsreactome-database- Pathway analysisopentargets-database- Drug targetsscikit-learn- Clustering and classificationtorch_geometric- Graph neural networksumap-learn- Dimensionality reductionstatsmodels- Survival analysispymoo- Multi-objective optimization
Workflow:
Step 1: Load and preprocess genomic data (WES/WGS)
- Parse VCF files with pysam
- Filter high-quality variants (QUAL > 30, DP > 20)
- Annotate with Ensembl VEP (missense, nonsense, frameshift)
- Query COSMIC for known cancer mutations
- Create mutation matrix: samples × genes (binary: mutated or not)
- Focus on cancer genes from COSMIC Cancer Gene Census
Step 2: Process transcriptomic data (RNA-seq)
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare tumor vs normal (if paired samples available)
- Normalize counts (TPM or FPKM)
- Identify highly variable genes
- Create expression matrix: samples × genes (log2 TPM)
Step 3: Load proteomic data (Mass spec)
- Protein abundance matrix from LC-MS/MS
- Normalize protein abundances (median normalization)
- Log2-transform
- Filter proteins detected in < 50% of samples
- Create protein matrix: samples × proteins
Step 4: Load clinical data
- Demographics: age, sex, race
- Tumor characteristics: stage, grade, histology
- Treatment: surgery, chemo, radiation, targeted therapy
- Outcome: overall survival (OS), progression-free survival (PFS)
- Response: complete/partial response, stable/progressive disease
Step 5: Data integration and harmonization
- Match sample IDs across omics layers
- Ensure consistent gene/protein identifiers
- Handle missing data:
* Impute with KNN or median (for moderate missingness)
* Remove features with > 50% missing
- Create multi-omics data structure (dictionary of matrices)
Step 6: Multi-omics dimensionality reduction
- Concatenate all omics features (genes + proteins + mutations)
- Apply UMAP with umap-learn for visualization
- Alternative: PCA or t-SNE
- Visualize samples in 2D space colored by:
* Histological subtype
* Stage
* Survival (high vs low)
- Identify patterns or clusters
Step 7: Unsupervised clustering to identify subtypes
- Perform consensus clustering with scikit-learn
- Test k = 2 to 10 clusters
- Evaluate cluster stability and optimal k
- Assign samples to clusters (subtypes)
- Visualize clustering in UMAP space
Step 8: Characterize molecular subtypes
For each subtype:
- Differential expression analysis:
* Compare subtype vs all others with PyDESeq2
* Extract top differentially expressed genes and proteins
- Mutation enrichment:
* Fisher's exact test for each gene
* Identify subtype-specific mutations
- Pathway enrichment:
* Query Reactome for enriched pathways
* Query KEGG for metabolic pathway differences
* Identify hallmark biological processes
Step 9: Build protein-protein interaction networks
- Query STRING database for interactions among:
* Differentially expressed proteins
* Products of mutated genes
- Construct PPI network with NetworkX
- Identify network modules (community detection)
- Calculate centrality metrics to find hub proteins
- Overlay fold changes on network for visualization
Step 10: Survival analysis by subtype
- Use statsmodels or lifelines for survival analysis
- Kaplan-Meier curves for each subtype
- Log-rank test for significance
- Cox proportional hazards model:
* Covariates: subtype, stage, age, treatment
* Estimate hazard ratios
- Identify prognostic subtypes
Step 11: Predict therapeutic response
- Train machine learning models with scikit-learn:
* Features: multi-omics data
* Target: response to specific therapy (responder/non-responder)
* Models: Random Forest, XGBoost, SVM
- Cross-validation to assess performance
- Identify features predictive of response
- Calculate AUC and feature importance
Step 12: Graph neural network for integrated prediction
- Build heterogeneous graph with Torch Geometric:
* Nodes: samples, genes, proteins, pathways
* Edges: gene-protein, protein-protein, gene-pathway
* Node features: expression, mutation status
- Train GNN to predict:
* Subtype classification
* Survival risk
* Treatment response
- Extract learned embeddings for interpretation
Step 13: Identify therapeutic targets with Open Targets
- For each subtype, query Open Targets:
* Input: upregulated genes/proteins
* Extract target-disease associations
* Prioritize by tractability score
- Search for FDA-approved drugs targeting identified proteins
- Identify clinical trials for relevant targets
- Propose subtype-specific therapeutic strategies
Step 14: Multi-objective optimization of treatment strategies
- Use PyMOO to optimize treatment selection:
* Objectives:
1. Maximize predicted response probability
2. Minimize predicted toxicity
3. Minimize cost
* Constraints: patient eligibility, drug availability
- Generate Pareto-optimal treatment strategies
- Personalized treatment recommendations per patient
Step 15: Generate comprehensive multi-omics report
- Sample clustering and subtype assignments
- UMAP visualization colored by subtype, survival, mutations
- Subtype characterization:
* Molecular signatures (genes, proteins, mutations)
* Enriched pathways
* PPI networks
- Kaplan-Meier survival curves by subtype
- ML model performance (AUC, confusion matrices)
- Feature importance plots
- Therapeutic target tables with supporting evidence
- Personalized treatment recommendations
- Clinical implications:
* Prognostic biomarkers
* Predictive biomarkers for therapy selection
* Novel drug targets
- Export publication-quality PDF with all figures and tables
Expected Output:
- Integrated multi-omics dataset
- Cancer subtype classification
- Molecular characterization of subtypes
- Survival analysis and prognostic markers
- Predictive models for treatment response
- Therapeutic target identification
- Personalized treatment strategies
- Comprehensive integrative genomics report
Experimental Physics & Data Analysis
Example 18: Analysis of Particle Physics Detector Data
Objective: Analyze experimental data from particle detector to identify signal events and measure physical constants.
Skills Used:
astropy- Units and constantssympy- Symbolic mathematicsscipy- Statistical analysisscikit-learn- Classificationstable-baselines3- Reinforcement learning for optimizationmatplotlib- Visualizationseaborn- Statistical plotsstatsmodels- Hypothesis testingdask- Large-scale data processingvaex- Out-of-core dataframes
Workflow:
Step 1: Load and inspect detector data
- Load ROOT files or HDF5 with raw detector signals
- Use Vaex for out-of-core processing (TBs of data)
- Inspect data structure: event IDs, timestamps, detector channels
- Extract key observables:
* Energy deposits in calorimeters
* Particle trajectories from tracking detectors
* Time-of-flight measurements
* Trigger information
Step 2: Apply detector calibration and corrections
- Load calibration constants
- Apply energy calibrations to convert ADC to physical units
- Correct for detector efficiency variations
- Apply geometric corrections (alignment)
- Use Astropy units for unit conversions (eV, GeV, MeV)
- Account for dead time and detector acceptance
Step 3: Event reconstruction
- Cluster energy deposits to form particle candidates
- Reconstruct particle trajectories (tracks)
- Match tracks to calorimeter clusters
- Calculate invariant masses for particle identification
- Compute momentum and energy for each particle
- Use Dask for parallel processing across events
Step 4: Event selection and filtering
- Define signal region based on physics hypothesis
- Apply quality cuts:
* Track quality (chi-squared, number of hits)
* Fiducial volume cuts
* Timing cuts (beam window)
* Particle identification cuts
- Estimate trigger efficiency
- Calculate event weights for corrections
Step 5: Background estimation
- Identify background sources:
* Cosmic rays
* Beam-related backgrounds
* Detector noise
* Physics backgrounds (non-signal processes)
- Simulate backgrounds using Monte Carlo (if available)
- Estimate background from data in control regions
- Use sideband subtraction method
Step 6: Signal extraction
- Fit invariant mass distributions to extract signal
- Use scipy for likelihood fitting:
* Signal model: Gaussian or Breit-Wigner
* Background model: polynomial or exponential
* Combined fit with maximum likelihood
- Calculate signal significance (S/√B or Z-score)
- Estimate systematic uncertainties
Step 7: Machine learning event classification
- Train classifier with scikit-learn to separate signal from background
- Features: kinematic variables, topology, detector response
- Models: Boosted Decision Trees (XGBoost), Neural Networks
- Cross-validate with k-fold CV
- Optimize selection criteria using ROC curves
- Calculate signal efficiency and background rejection
Step 8: Reinforcement learning for trigger optimization
- Use Stable-Baselines3 to optimize trigger thresholds
- Environment: detector simulator
- Action: adjust trigger thresholds
- Reward: maximize signal efficiency while controlling rate
- Train PPO or SAC agent
- Validate on real data
Step 9: Calculate physical observables
- Measure cross-sections:
* σ = N_signal / (ε × L × BR)
* N_signal: number of signal events
* ε: detection efficiency
* L: integrated luminosity
* BR: branching ratio
- Use Sympy for symbolic error propagation
- Calculate with Astropy for proper unit handling
Step 10: Statistical analysis and hypothesis testing
- Perform hypothesis tests with statsmodels:
* Likelihood ratio test for signal vs background-only
* Calculate p-values and significance levels
* Set confidence limits (CLs method)
- Bayesian analysis for parameter estimation
- Calculate confidence intervals and error bands
Step 11: Systematic uncertainty evaluation
- Identify sources of systematic uncertainty:
* Detector calibration uncertainties
* Background estimation uncertainties
* Theoretical uncertainties (cross-sections, PDFs)
* Monte Carlo modeling uncertainties
- Propagate uncertainties through analysis chain
- Combine statistical and systematic uncertainties
- Present as error budget
Step 12: Create comprehensive physics report
- Event displays showing candidate signal events
- Kinematic distributions (momentum, energy, angles)
- Invariant mass plots with fitted signal
- ROC curves for ML classifiers
- Cross-section measurements with error bars
- Comparison with theoretical predictions
- Systematic uncertainty breakdown
- Statistical significance calculations
- Interpretation:
* Consistency with Standard Model
* Constraints on new physics parameters
* Discovery potential or exclusion limits
- Recommendations:
* Detector improvements
* Additional data needed
* Future analysis strategies
- Export publication-ready PDF formatted for physics journal
Expected Output:
- Reconstructed physics events
- Signal vs background classification
- Measured cross-sections and branching ratios
- Statistical significance of observations
- Systematic uncertainty analysis
- Comprehensive experimental physics paper
Chemical Engineering & Process Optimization
Example 19: Optimization of Chemical Reactor Design and Operation
Objective: Design and optimize a continuous chemical reactor for maximum yield and efficiency while meeting safety and economic constraints.
Skills Used:
sympy- Symbolic equations and reaction kineticsscipy- Numerical integration and optimizationpymoo- Multi-objective optimizationsimpy- Process simulationpymc- Bayesian parameter estimationscikit-learn- Process modelingstable-baselines3- Real-time control optimizationmatplotlib- Process diagramsreportlab- Engineering reports
Workflow:
Step 1: Define reaction system and kinetics
- Chemical reaction: A + B → C + D
- Use Sympy to define symbolic rate equations:
* Arrhenius equation: k = A × exp(-Ea/RT)
* Rate law: r = k × [A]^α × [B]^β
- Define material and energy balances symbolically
- Include equilibrium constants and thermodynamics
- Account for side reactions and byproducts
Step 2: Develop reactor model
- Select reactor type: CSTR, PFR, batch, or semi-batch
- Write conservation equations:
* Mass balance: dC/dt = (F_in × C_in - F_out × C)/V + r
* Energy balance: ρCp × dT/dt = Q - ΔH_rxn × r × V
* Momentum balance (pressure drop)
- Include heat transfer correlations
- Model mixing and mass transfer limitations
Step 3: Parameter estimation with PyMC
- Load experimental data from pilot reactor
- Bayesian inference to estimate kinetic parameters:
* Pre-exponential factor (A)
* Activation energy (Ea)
* Reaction orders (α, β)
- Use MCMC sampling with PyMC
- Incorporate prior knowledge from literature
- Calculate posterior distributions and credible intervals
- Assess parameter uncertainty and correlation
Step 4: Model validation
- Simulate reactor with estimated parameters using scipy.integrate
- Compare predictions with experimental data
- Calculate goodness of fit (R², RMSE)
- Perform sensitivity analysis:
* Which parameters most affect yield?
* Identify critical operating conditions
- Refine model if needed
Step 5: Machine learning surrogate model
- Train fast surrogate model with scikit-learn
- Generate training data from detailed model (1000+ runs)
- Features: T, P, residence time, feed composition, catalyst loading
- Target: yield, selectivity, conversion
- Models: Gaussian Process Regression, Random Forest
- Validate surrogate accuracy (R² > 0.95)
- Use for rapid optimization
Step 6: Single-objective optimization
- Maximize yield with scipy.optimize:
* Decision variables: T, P, feed ratio, residence time
* Objective: maximize Y = (moles C produced) / (moles A fed)
* Constraints:
- Temperature: 300 K ≤ T ≤ 500 K (safety)
- Pressure: 1 bar ≤ P ≤ 50 bar (equipment limits)
- Residence time: 1 min ≤ τ ≤ 60 min
- Conversion: X_A ≥ 90%
- Use Sequential Least Squares Programming (SLSQP)
- Identify optimal operating point
Step 7: Multi-objective optimization with PyMOO
- Competing objectives:
1. Maximize product yield
2. Minimize energy consumption (heating/cooling)
3. Minimize operating cost (raw materials, utilities)
4. Maximize reactor productivity (throughput)
- Constraints:
- Safety: temperature and pressure limits
- Environmental: waste production limits
- Economic: minimum profitability
- Run NSGA-II or NSGA-III
- Generate Pareto front of optimal solutions
- Select operating point based on preferences
Step 8: Dynamic process simulation with SimPy
- Model complete plant:
* Reactors, separators, heat exchangers
* Pumps, compressors, valves
* Storage tanks and buffers
- Simulate startup, steady-state, and shutdown
- Include disturbances:
* Feed composition variations
* Equipment failures
* Demand fluctuations
- Evaluate dynamic stability
- Calculate time to steady state
Step 9: Control system design
- Design feedback control loops:
* Temperature control (PID controller)
* Pressure control
* Flow control
* Level control
- Tune PID parameters using Ziegler-Nichols or optimization
- Implement cascade control for improved performance
- Add feedforward control for disturbance rejection
Step 10: Reinforcement learning for advanced control
- Use Stable-Baselines3 to train RL agent:
* Environment: reactor simulation (SimPy-based)
* State: T, P, concentrations, flow rates
* Actions: adjust setpoints, flow rates, heating/cooling
* Reward: +yield -energy cost -deviation from setpoint
- Train PPO or TD3 agent
- Compare with conventional PID control
- Evaluate performance under disturbances
- Implement model-free adaptive control
Step 11: Economic analysis
- Calculate capital costs (CAPEX):
* Reactor vessel cost (function of size, pressure rating)
* Heat exchanger costs
* Pumps and instrumentation
* Installation costs
- Calculate operating costs (OPEX):
* Raw materials (A, B, catalyst)
* Utilities (steam, cooling water, electricity)
* Labor and maintenance
- Revenue from product sales
- Calculate economic metrics:
* Net present value (NPV)
* Internal rate of return (IRR)
* Payback period
* Levelized cost of production
Step 12: Safety analysis
- Identify hazards:
* Exothermic runaway reactions
* Pressure buildup
* Toxic or flammable materials
- Perform HAZOP-style analysis
- Calculate safe operating limits:
* Maximum temperature of synthesis reaction (MTSR)
* Adiabatic temperature rise
* Relief valve sizing
- Design emergency shutdown systems
- Implement safety interlocks
Step 13: Uncertainty quantification
- Propagate parameter uncertainties from PyMC:
* How does kinetic parameter uncertainty affect yield?
* Monte Carlo simulation with parameter distributions
- Evaluate robustness of optimal design
- Calculate confidence intervals on economic metrics
- Identify critical uncertainties for further study
Step 14: Generate comprehensive engineering report
- Executive summary of project objectives and results
- Process flow diagram (PFD) with material and energy streams
- Reaction kinetics and model equations
- Parameter estimation results with uncertainties
- Optimization results:
* Pareto front for multi-objective optimization
* Recommended operating conditions
* Trade-off analysis
- Dynamic simulation results (startup curves, response to disturbances)
- Control system design and tuning
- Economic analysis with sensitivity to key assumptions
- Safety analysis and hazard mitigation
- Scale-up considerations:
* Pilot to commercial scale
* Heat and mass transfer limitations
* Equipment sizing
- Recommendations:
* Optimal reactor design (size, type, materials of construction)
* Operating conditions for maximum profitability
* Control strategy
* Further experimental studies needed
- Technical drawings and P&ID (piping and instrumentation diagram)
- Export as professional engineering report (PDF)
Expected Output:
- Validated reactor model with parameter uncertainties
- Optimal reactor design and operating conditions
- Pareto-optimal solutions for multi-objective optimization
- Dynamic process simulation results
- Advanced control strategies (RL-based)
- Economic feasibility analysis
- Safety assessment
- Comprehensive chemical engineering design report
Scientific Illustration & Visual Communication
Example 20: Creating Publication-Ready Scientific Figures
Objective: Generate and refine scientific illustrations, diagrams, and graphical abstracts for publications and presentations.
Skills Used:
generate-image- AI image generation and editingmatplotlib- Data visualizationscientific-visualization- Best practicesscientific-writing- Figure caption creationreportlab- PDF report generation
Workflow:
Step 1: Plan visual communication strategy
- Identify key concepts that need visual representation:
* Experimental workflow diagrams
* Molecular structures and interactions
* Data visualization (handled by matplotlib)
* Conceptual illustrations for mechanisms
* Graphical abstract for paper summary
- Determine appropriate style for target journal/audience
- Sketch rough layouts for each figure
Step 2: Generate experimental workflow diagram
- Use generate-image skill with detailed prompt:
"Scientific illustration showing a step-by-step experimental
workflow for CRISPR gene editing: (1) guide RNA design at computer,
(2) cell culture in petri dish, (3) electroporation device,
(4) selection with antibiotics, (5) sequencing validation.
Clean, professional style with numbered steps, white background,
suitable for scientific publication."
- Save as workflow_diagram.png
- Review and iterate on prompt if needed
Step 3: Create molecular interaction schematic
- Generate detailed molecular visualization:
"Scientific diagram of protein-ligand binding mechanism:
show receptor protein (blue ribbon structure) with binding pocket,
small molecule ligand (ball-and-stick, orange) approaching,
key hydrogen bonds indicated with dashed lines, water molecules
in binding site. Professional biochemistry illustration style,
clean white background, publication quality."
- Generate multiple versions with different angles/styles
- Select best representation
Step 4: Edit existing figures for consistency
- Load existing figure that needs modification:
python scripts/generate_image.py "Change the background to white
and make the protein blue instead of green" --input figure1.png
- Standardize color schemes across all figures
- Edit to match journal style guidelines:
python scripts/generate_image.py "Remove the title text and
increase contrast for print publication" --input diagram.png
Step 5: Generate graphical abstract
- Create comprehensive visual summary:
"Graphical abstract for cancer immunotherapy paper: left side
shows tumor cells (irregular shapes, red) being attacked by
T cells (round, blue). Center shows the drug molecule structure.
Right side shows healthy tissue (green). Arrow flow from left
to right indicating treatment progression. Modern, clean style
with minimal text, high contrast, suitable for journal TOC."
- Ensure dimensions meet journal requirements
- Iterate to highlight key findings
Step 6: Create conceptual mechanism illustrations
- Generate mechanism diagrams:
"Scientific illustration of enzyme catalysis mechanism:
Show substrate entering active site (step 1), transition state
formation with electron movement arrows (step 2), product
release (step 3). Use standard biochemistry notation,
curved arrows for electron movement, clear labeling."
- Generate alternative representations for supplementary materials
Step 7: Produce presentation-ready figures
- Create high-impact visuals for talks:
"Eye-catching scientific illustration of DNA double helix
unwinding during replication, with DNA polymerase (large
green structure) adding nucleotides. Dynamic composition,
vibrant but professional colors, dark background for
presentation slides."
- Adjust style for poster vs slide format
- Create versions at different resolutions
Step 8: Generate figure panels for multi-part figures
- Create consistent series of related images:
"Panel A: Normal cell with intact membrane (green outline)
Panel B: Cell under oxidative stress with damaged membrane
Panel C: Cell treated with antioxidant, membrane recovering
Consistent style across all panels, same scale, white background,
scientific illustration style suitable for publication."
- Ensure visual consistency across panels
- Annotate with panel labels
Step 9: Edit for accessibility
- Modify figures for colorblind accessibility:
python scripts/generate_image.py "Change the red and green
elements to blue and orange for colorblind accessibility,
maintain all other aspects" --input figure_v1.png
- Add patterns or textures for additional differentiation
- Verify contrast meets accessibility standards
Step 10: Create supplementary visual materials
- Generate additional context figures:
"Anatomical diagram showing location of pancreatic islets
within the pancreas, cross-section view with labeled structures:
alpha cells, beta cells, blood vessels. Medical illustration
style, educational, suitable for supplementary materials."
- Create protocol flowcharts and decision trees
- Generate equipment setup diagrams
Step 11: Compile figure legends and captions
- Use scientific-writing skill to create descriptions:
* Figure number and title
* Detailed description of what is shown
* Explanation of symbols, colors, and abbreviations
* Scale bars and measurement units
* Statistical information if applicable
- Format according to journal guidelines
Step 12: Assemble final publication package
- Organize all figures in publication order
- Create high-resolution exports (300+ DPI for print)
- Generate both RGB (web) and CMYK (print) versions
- Compile into PDF using ReportLab:
* Title page with graphical abstract
* All figures with captions
* Supplementary figures section
- Create separate folder with individual figure files
- Document all generation prompts for reproducibility
Expected Output:
- Complete set of publication-ready scientific illustrations
- Graphical abstract for table of contents
- Mechanism diagrams and workflow figures
- Edited versions meeting journal style guidelines
- Accessibility-compliant figure versions
- Figure package with captions and metadata
- Documentation of prompts used for reproducibility
Summary
These examples demonstrate:
- Cross-domain applicability: Skills are useful across many scientific fields
- Skill integration: Complex workflows combine multiple databases, packages, and analysis methods
- Real-world relevance: Examples address actual research questions and clinical needs
- End-to-end workflows: From data acquisition to publication-ready reports
- Best practices: QC, statistical rigor, visualization, interpretation, and documentation
How to Use These Examples
- Adapt to your needs: Modify parameters, datasets, and objectives for your specific research question
- Combine skills creatively: Mix and match skills from different categories
- Follow the structure: Each example provides a clear step-by-step workflow
- Generate comprehensive output: Aim for publication-quality figures and professional reports
- Cite your sources: Always verify data and provide proper citations
Additional Notes
- Always start with: "Always use available 'skills' when possible. Keep the output organized."
- For complex projects, break into manageable steps and validate intermediate results
- Save checkpoints and intermediate data files
- Document parameters and decisions for reproducibility
- Generate README files explaining methodology
- Create PDFs for stakeholder communication
These examples showcase the power of combining the skills in this repository to tackle complex, real-world scientific challenges across multiple domains.