89 KiB
Real-World Scientific Examples
This document provides comprehensive, practical examples demonstrating how to combine Claude Scientific Skills to solve real scientific problems across multiple domains.
📋 Table of Contents
- Drug Discovery & Medicinal Chemistry
- Cancer Genomics & Precision Medicine
- Single-Cell Transcriptomics
- Protein Structure & Function
- Chemical Safety & Toxicology
- Clinical Trial Analysis
- Metabolomics & Systems Biology
- Materials Science & Chemistry
- Digital Pathology
- Lab Automation & Protocol Design
- Agricultural Genomics
- Neuroscience & Brain Imaging
- Environmental Microbiology
- Infectious Disease Research
- Multi-Omics Integration
- Computational Chemistry & Synthesis
- Clinical Research & Real-World Evidence
- Experimental Physics & Data Analysis
- Chemical Engineering & Process Optimization
Drug Discovery & Medicinal Chemistry
Example 1: Discovery of Novel EGFR Inhibitors for Lung Cancer
Objective: Identify novel small molecule inhibitors of EGFR with improved properties compared to existing drugs.
Skills Used:
chembl-database- Query bioactivity datapubchem-database- Search compound librariesrdkit- Analyze molecular propertiesdatamol- Generate analogsdiffdock- Molecular dockingalphafold-database- Retrieve protein structurepubmed-database- Literature reviewcosmic-database- Query mutationsdeepchem- Property predictionscientific-visualization- Create figures
Workflow:
# Always use available 'skills' when possible. Keep the output organized.
Step 1: Query ChEMBL for known EGFR inhibitors with high potency
- Search for compounds targeting EGFR (CHEMBL203)
- Filter: IC50 < 50 nM, pChEMBL value > 7
- Extract SMILES strings and activity data
- Export to DataFrame for analysis
Step 2: Analyze structure-activity relationships
- Load compounds into RDKit
- Calculate molecular descriptors (MW, LogP, TPSA, HBD, HBA)
- Generate Morgan fingerprints (radius=2, 2048 bits)
- Perform hierarchical clustering to identify scaffolds
- Visualize top scaffolds with activity annotations
Step 3: Identify resistance mutations from COSMIC
- Query COSMIC for EGFR mutations in lung cancer
- Focus on gatekeeper mutations (T790M, C797S)
- Extract mutation frequencies and clinical significance
- Cross-reference with literature in PubMed
Step 4: Retrieve EGFR structure from AlphaFold
- Download AlphaFold prediction for EGFR kinase domain
- Alternatively, use experimental structure from PDB (if available)
- Prepare structure for docking (add hydrogens, optimize)
Step 5: Generate novel analogs using datamol
- Select top 5 scaffolds from ChEMBL analysis
- Use scaffold decoration to generate 100 analogs per scaffold
- Apply Lipinski's Rule of Five filtering
- Ensure synthetic accessibility (SA score < 4)
- Check for PAINS and unwanted substructures
Step 6: Predict properties with DeepChem
- Train graph convolutional model on ChEMBL EGFR data
- Predict pIC50 for generated analogs
- Predict ADMET properties (solubility, permeability, hERG)
- Rank candidates by predicted potency and drug-likeness
Step 7: Virtual screening with DiffDock
- Perform molecular docking on top 50 candidates
- Dock into wild-type EGFR and T790M mutant
- Calculate binding energies and interaction patterns
- Identify compounds with favorable binding to both forms
Step 8: Search PubChem for commercial availability
- Query PubChem for top 10 candidates by InChI key
- Check supplier information and purchasing options
- Identify close analogs if exact matches unavailable
Step 9: Literature validation with PubMed
- Search for any prior art on top scaffolds
- Query: "[scaffold_name] AND EGFR AND inhibitor"
- Summarize relevant findings and potential liabilities
Step 10: Create comprehensive report
- Generate 2D structure visualizations of top hits
- Create scatter plots: MW vs LogP, TPSA vs potency
- Produce binding pose figures for top 3 compounds
- Generate table comparing properties to approved drugs (gefitinib, erlotinib)
- Write scientific summary with methodology, results, and recommendations
- Export to PDF with proper citations
Expected Output:
- Ranked list of 10-20 novel EGFR inhibitor candidates
- Predicted activity and ADMET properties
- Docking poses and binding analysis
- Comprehensive scientific report with publication-quality figures
Example 2: Drug Repurposing for Rare Diseases
Objective: Identify FDA-approved drugs that could be repurposed for treating a rare metabolic disorder.
Skills Used:
drugbank-database- Query approved drugsopentargets-database- Target-disease associationsstring-database- Protein interactionskegg-database- Pathway analysisreactome-database- Pathway enrichmentclinicaltrials-database- Check ongoing trialsfda-database- Drug approvals and safetynetworkx- Network analysisliterature-review- Systematic review
Workflow:
Step 1: Define disease pathway
- Query KEGG and Reactome for disease-associated pathways
- Identify key proteins and enzymes involved
- Map upstream and downstream pathway components
Step 2: Find protein-protein interactions
- Query STRING database for interaction partners
- Build protein interaction network around key disease proteins
- Identify hub proteins and bottlenecks using NetworkX
- Calculate centrality metrics (betweenness, closeness)
Step 3: Query Open Targets for druggable targets
- Search for targets associated with disease phenotype
- Filter by clinical precedence and tractability
- Prioritize targets with existing approved drugs
Step 4: Search DrugBank for drugs targeting identified proteins
- Query for approved drugs and their targets
- Filter by mechanism of action relevant to disease
- Retrieve drug properties and safety information
Step 5: Query FDA databases for safety profiles
- Check FDA adverse event database (FAERS)
- Review drug labels and black box warnings
- Assess risk-benefit for rare disease population
Step 6: Search ClinicalTrials.gov for prior repurposing attempts
- Query for disease name + drug names
- Check for failed trials (and reasons for failure)
- Identify ongoing trials that may compete
Step 7: Perform pathway enrichment analysis
- Map drug targets to disease pathways
- Calculate enrichment scores with Reactome
- Identify drugs affecting multiple pathway nodes
Step 8: Conduct systematic literature review
- Search PubMed for drug name + disease associations
- Include bioRxiv for recent unpublished findings
- Document any case reports or off-label use
- Use literature-review skill to generate comprehensive review
Step 9: Prioritize candidates
- Rank by: pathway relevance, safety profile, existing evidence
- Consider factors: oral availability, blood-brain barrier penetration
- Assess commercial viability and patent status
Step 10: Generate repurposing report
- Create network visualization of drug-target-pathway relationships
- Generate comparison table of top 5 candidates
- Write detailed rationale for each candidate
- Include mechanism of action diagrams
- Provide recommendations for preclinical validation
- Format as professional PDF with citations
Expected Output:
- Ranked list of 5-10 repurposing candidates
- Network analysis of drug-target-disease relationships
- Safety and efficacy evidence summary
- Repurposing strategy report with next steps
Cancer Genomics & Precision Medicine
Example 3: Clinical Variant Interpretation Pipeline
Objective: Analyze a patient's tumor sequencing data to identify actionable mutations and therapeutic recommendations.
Skills Used:
pysam- Parse VCF filesensembl-database- Variant annotationclinvar-database- Clinical significancecosmic-database- Somatic mutationsgene-database- Gene informationuniprot-database- Protein impactdrugbank-database- Drug-gene associationsclinicaltrials-database- Matching trialsopentargets-database- Target validationpubmed-database- Literature evidencereportlab- Generate clinical report
Workflow:
Step 1: Parse and filter VCF file
- Use pysam to read tumor VCF
- Filter for high-quality variants (QUAL > 30, DP > 20)
- Extract variant positions, alleles, and VAF (variant allele frequency)
- Separate SNVs, indels, and structural variants
Step 2: Annotate variants with Ensembl
- Query Ensembl VEP API for functional consequences
- Classify variants: missense, nonsense, frameshift, splice site
- Extract transcript information and protein changes
- Identify canonical transcripts for each gene
Step 3: Query ClinVar for known pathogenic variants
- Search ClinVar by genomic coordinates
- Extract clinical significance classifications
- Note conflicting interpretations and review status
- Prioritize variants with "Pathogenic" or "Likely Pathogenic" labels
Step 4: Query COSMIC for somatic cancer mutations
- Search COSMIC for each variant
- Extract mutation frequency across cancer types
- Identify hotspot mutations (high recurrence)
- Note drug resistance mutations
Step 5: Retrieve gene information from NCBI Gene
- Get detailed gene descriptions
- Extract associated phenotypes and diseases
- Identify oncogene vs tumor suppressor classification
- Note gene function and biological pathways
Step 6: Assess protein-level impact with UniProt
- Query UniProt for protein domain information
- Map variants to functional domains (kinase domain, binding site)
- Check if variant affects active sites or protein stability
- Retrieve post-translational modification sites
Step 7: Search DrugBank for targetable alterations
- Query for drugs targeting mutated genes
- Filter for FDA-approved and investigational drugs
- Extract mechanism of action and indications
- Prioritize variants with approved targeted therapies
Step 8: Query Open Targets for target-disease associations
- Validate therapeutic hypotheses
- Assess target tractability scores
- Review clinical precedence for each gene-disease pair
Step 9: Search ClinicalTrials.gov for matching trials
- Build query with: cancer type + gene names + variants
- Filter for: recruiting status, phase II/III trials
- Extract trial eligibility criteria
- Note geographic locations and contact information
Step 10: Literature search for clinical evidence
- PubMed query: "[gene] AND [variant] AND [cancer type]"
- Focus on: case reports, clinical outcomes, resistance mechanisms
- Extract relevant prognostic or predictive information
Step 11: Classify variants by actionability
Tier 1: FDA-approved therapy for this variant
Tier 2: Clinical trial available for this variant
Tier 3: Therapy approved for variant in different cancer
Tier 4: Biological evidence but no approved therapy
Step 12: Generate clinical genomics report
- Executive summary of key findings
- Table of actionable variants with evidence levels
- Therapeutic recommendations with supporting evidence
- Clinical trial options with eligibility information
- Prognostic implications based on mutation profile
- References to guidelines (NCCN, ESMO, AMP/ASCO/CAP)
- Generate professional PDF using ReportLab
Expected Output:
- Annotated variant list with clinical significance
- Tiered list of actionable mutations
- Therapeutic recommendations with evidence levels
- Matching clinical trials
- Comprehensive clinical genomics report (PDF)
Example 4: Cancer Subtype Classification from Gene Expression
Objective: Classify breast cancer subtypes using RNA-seq data and identify subtype-specific therapeutic vulnerabilities.
Skills Used:
pydeseq2- Differential expressionscanpy- Clustering and visualizationscikit-learn- Machine learning classificationgene-database- Gene annotationreactome-database- Pathway analysisopentargets-database- Drug targetspubmed-database- Literature validationmatplotlib- Visualizationseaborn- Heatmaps
Workflow:
Step 1: Load and preprocess RNA-seq data
- Load count matrix (genes × samples)
- Filter low-expression genes (mean counts < 10)
- Normalize with DESeq2 size factors
- Apply variance-stabilizing transformation (VST)
Step 2: Classify samples using PAM50 genes
- Query NCBI Gene for PAM50 classifier gene list
- Extract expression values for PAM50 genes
- Train Random Forest classifier on labeled training data
- Predict subtypes: Luminal A, Luminal B, HER2+, Basal, Normal-like
- Validate with published markers (ESR1, PGR, ERBB2, MKI67)
Step 3: Perform differential expression for each subtype
- Use PyDESeq2 to compare each subtype vs all others
- Apply multiple testing correction (FDR < 0.05)
- Filter by log2 fold change (|LFC| > 1.5)
- Identify subtype-specific signature genes
Step 4: Annotate differentially expressed genes
- Query NCBI Gene for detailed annotations
- Classify as oncogene, tumor suppressor, or other
- Extract biological process and molecular function terms
Step 5: Pathway enrichment analysis
- Submit gene lists to Reactome API
- Identify enriched pathways for each subtype (p < 0.01)
- Focus on druggable pathways (kinase signaling, metabolism)
- Compare pathway profiles across subtypes
Step 6: Identify therapeutic targets with Open Targets
- Query Open Targets for each upregulated gene
- Filter by tractability score > 5
- Prioritize targets with clinical precedence
- Extract associated drugs and development phase
Step 7: Create comprehensive visualization
- Generate UMAP projection of all samples colored by subtype
- Create heatmap of PAM50 genes across subtypes
- Produce volcano plots for each subtype comparison
- Generate pathway enrichment dot plots
- Create drug target-pathway network diagrams
Step 8: Literature validation
- Search PubMed for each predicted therapeutic target
- Query: "[gene] AND [subtype] AND breast cancer AND therapy"
- Summarize clinical evidence and ongoing trials
- Note any resistance mechanisms reported
Step 9: Generate subtype-specific recommendations
For each subtype:
- List top 5 differentially expressed genes
- Identify enriched biological pathways
- Recommend therapeutic strategies based on vulnerabilities
- Cite supporting evidence from literature
Step 10: Create comprehensive report
- Classification results with confidence scores
- Differential expression tables for each subtype
- Pathway enrichment summaries
- Therapeutic target recommendations
- Publication-quality figures
- Export to PDF with citations
Expected Output:
- Sample classification into molecular subtypes
- Subtype-specific gene signatures
- Pathway enrichment profiles
- Prioritized therapeutic targets for each subtype
- Scientific report with visualizations and recommendations
Single-Cell Transcriptomics
Example 5: Single-Cell Atlas of Tumor Microenvironment
Objective: Characterize immune cell populations in tumor microenvironment and identify immunotherapy biomarkers.
Skills Used:
scanpy- Single-cell analysisscvi-tools- Batch correction and integrationcellxgene-census- Reference datagene-database- Cell type markersanndata- Data structurearboreto- Gene regulatory networkspytorch-lightning- Deep learningmatplotlib- Visualizationstatistical-analysis- Hypothesis testing
Workflow:
Step 1: Load and QC 10X Genomics data
- Use Scanpy to read 10X h5 files
- Calculate QC metrics: n_genes, n_counts, pct_mitochondrial
- Identify mitochondrial genes (MT- prefix)
- Filter cells: 200 < n_genes < 5000, pct_mt < 20%
- Filter genes: expressed in at least 10 cells
- Document filtering criteria and cell retention rate
Step 2: Normalize and identify highly variable genes
- Normalize to 10,000 counts per cell
- Log-transform data (log1p)
- Store raw counts in adata.raw
- Identify 3,000 highly variable genes
- Regress out technical variation (n_counts, pct_mt)
- Scale to unit variance, clip at 10 standard deviations
Step 3: Integrate with reference atlas using scVI
- Download reference tumor microenvironment data from Cellxgene Census
- Train scVI model on combined dataset for batch correction
- Use scVI latent representation for downstream analysis
- Generate batch-corrected expression matrix
Step 4: Dimensionality reduction and clustering
- Compute neighborhood graph (n_neighbors=15, n_pcs=50)
- Calculate UMAP embedding for visualization
- Perform Leiden clustering at multiple resolutions (0.3, 0.5, 0.8)
- Select optimal resolution based on silhouette score
Step 5: Identify cell type markers
- Run differential expression for each cluster (Wilcoxon test)
- Calculate marker scores (log fold change, p-value, pct expressed)
- Query NCBI Gene for canonical immune cell markers:
* T cells: CD3D, CD3E, CD4, CD8A
* B cells: CD19, MS4A1 (CD20), CD79A
* Myeloid: CD14, CD68, CD163
* NK cells: NKG7, GNLY, NCAM1
* Dendritic: CD1C, CLEC9A, LILRA4
Step 6: Annotate cell types
- Assign cell type labels based on marker expression
- Refine annotations with CellTypist or manual curation
- Identify T cell subtypes: CD4+, CD8+, Tregs, exhausted T cells
- Characterize myeloid cells: M1/M2 macrophages, dendritic cells
- Create cell type proportion tables by sample/condition
Step 7: Identify tumor-specific features
- Compare tumor samples vs normal tissue (if available)
- Identify expanded T cell clones (high proliferation markers)
- Detect exhausted T cells (PDCD1, CTLA4, LAG3, HAVCR2)
- Characterize immunosuppressive populations (Tregs, M2 macrophages)
Step 8: Gene regulatory network inference
- Use Arboreto/GRNBoost2 on each major cell type
- Identify transcription factors driving cell states
- Focus on exhaustion TFs: TOX, TCF7, EOMES
- Build regulatory networks for visualization
Step 9: Statistical analysis of cell proportions
- Calculate cell type frequencies per sample
- Test for significant differences between groups (responders vs non-responders)
- Use statistical-analysis skill for appropriate tests (t-test, Mann-Whitney)
- Calculate effect sizes and confidence intervals
Step 10: Biomarker discovery for immunotherapy response
- Correlate cell type abundances with clinical response
- Identify gene signatures associated with response
- Test signatures: T cell exhaustion, antigen presentation, inflammation
- Validate with published immunotherapy response signatures
Step 11: Create comprehensive visualizations
- UMAP plots colored by: cell type, sample, treatment, key genes
- Dot plots of canonical markers across cell types
- Cell type proportion bar plots by condition
- Heatmap of top differentially expressed genes per cell type
- Gene regulatory network diagrams
- Volcano plots for differentially abundant cell types
Step 12: Generate scientific report
- Methods: QC, normalization, batch correction, clustering
- Results: Cell type composition, differential abundance, markers
- Biomarker analysis: Predictive signatures and validation
- High-quality figures suitable for publication
- Export processed h5ad file and PDF report
Expected Output:
- Annotated single-cell atlas with cell type labels
- Cell type composition analysis
- Biomarker signatures for immunotherapy response
- Gene regulatory networks for key cell states
- Comprehensive report with publication-quality figures
Protein Structure & Function
Example 6: Structure-Based Design of Protein-Protein Interaction Inhibitors
Objective: Design small molecules to disrupt a therapeutically relevant protein-protein interaction.
Skills Used:
alphafold-database- Protein structurespdb-database- Experimental structuresuniprot-database- Protein informationbiopython- Structure analysispyrosetta- Protein design (if available)rdkit- Chemical library generationdiffdock- Molecular dockingzinc-database- Screening librarydeepchem- Property predictionpymol- Visualization (external)
Workflow:
Step 1: Retrieve protein structures
- Query AlphaFold Database for both proteins in the interaction
- Download PDB files and confidence scores
- If available, get experimental structures from PDB database
- Compare AlphaFold predictions with experimental structures (if any)
Step 2: Analyze protein interaction interface
- Load structures with BioPython
- Identify interface residues (distance < 5Å between proteins)
- Calculate interface area and binding energy contribution
- Identify hot spot residues (key for binding)
- Map to UniProt to get functional annotations
Step 3: Characterize binding pocket
- Identify cavities at the protein-protein interface
- Calculate pocket volume and surface area
- Assess druggability: depth, hydrophobicity, shape
- Identify hydrogen bond donors/acceptors
- Note any known allosteric sites
Step 4: Query UniProt for known modulators
- Search UniProt for both proteins
- Extract information on known inhibitors or modulators
- Review PTMs that affect interaction
- Check disease-associated mutations in interface
Step 5: Search ZINC15 for fragment library
- Query ZINC for fragments matching pocket criteria:
* Molecular weight: 150-300 Da
* LogP: 0-3 (appropriate for PPI inhibitors)
* Exclude PAINS and aggregators
- Download 1,000-5,000 fragment SMILES
Step 6: Virtual screening with fragment library
- Use DiffDock to dock fragments into interface pocket
- Rank by predicted binding affinity
- Identify fragments binding to hot spot residues
- Select top 50 fragments for elaboration
Step 7: Fragment elaboration with RDKit
- For each fragment hit, generate elaborated molecules:
* Add substituents to core scaffold
* Merge fragments binding to adjacent pockets
* Apply medicinal chemistry filters
- Generate 20-50 analogs per fragment
- Filter by Lipinski's Ro5 and PPI-specific rules (MW 400-700)
Step 8: Second round of virtual screening
- Dock elaborated molecules with DiffDock
- Calculate binding energies and interaction patterns
- Prioritize molecules with:
* Strong binding to hot spot residues
* Multiple H-bonds and hydrophobic contacts
* Favorable predicted ΔG
Step 9: Predict ADMET properties with DeepChem
- Train models on ChEMBL data
- Predict: solubility, permeability, hERG liability
- Filter for drug-like properties
- Rank by overall score (affinity + ADMET)
Step 10: Literature and patent search
- PubMed: "[protein A] AND [protein B] AND inhibitor"
- USPTO: Check for prior art on top scaffolds
- Assess freedom to operate
- Identify any reported PPI inhibitors for this target
Step 11: Prepare molecules for synthesis
- Assess synthetic accessibility (SA score < 4)
- Identify commercial building blocks
- Propose synthetic routes for top 10 candidates
- Calculate estimated synthesis cost
Step 12: Generate comprehensive design report
- Interface analysis with hot spot identification
- Fragment screening results
- Top 10 designed molecules with predicted properties
- Docking poses and interaction diagrams
- Synthetic accessibility assessment
- Comparison to known PPI inhibitors
- Recommendations for experimental validation
- Publication-quality figures and PDF report
Expected Output:
- Interface characterization and hot spot analysis
- Ranked library of designed PPI inhibitors
- Predicted binding modes and affinities
- ADMET property predictions
- Synthetic accessibility assessment
- Comprehensive drug design report
Chemical Safety & Toxicology
Example 7: Predictive Toxicology Assessment
Objective: Assess potential toxicity and safety liabilities of drug candidates before synthesis.
Skills Used:
rdkit- Molecular descriptorsdeepchem- Toxicity predictionchembl-database- Toxicity datapubchem-database- Bioassay datadrugbank-database- Known drug toxicitiesfda-database- Adverse eventshmdb-database- Metabolite predictionscikit-learn- Classification modelsshap- Model interpretability
Workflow:
Step 1: Calculate molecular descriptors
- Load candidate molecules with RDKit
- Calculate physicochemical properties:
* MW, LogP, TPSA, rotatable bonds, H-bond donors/acceptors
* Aromatic rings, sp3 fraction, formal charge
- Calculate structural alerts:
* PAINS patterns
* Toxic functional groups (nitroaromatics, epoxides, etc.)
* Genotoxic alerts (Ames mutagenicity)
Step 2: Screen for known toxicophores
- Search for structural alerts using SMARTS patterns:
* Michael acceptors
* Aldehyde/ketone reactivity
* Quinones and quinone-like structures
* Thioureas and isocyanates
- Flag molecules with high-risk substructures
Step 3: Query ChEMBL for similar compounds with toxicity data
- Perform similarity search (Tanimoto > 0.7)
- Extract toxicity assay results:
* Cytotoxicity (IC50 values)
* Hepatotoxicity markers
* Cardiotoxicity (hERG inhibition)
* Genotoxicity (Ames test results)
- Analyze structure-toxicity relationships
Step 4: Search PubChem BioAssays for toxicity screening
- Query relevant assays:
* Tox21 panel (cell viability, stress response, genotoxicity)
* Liver toxicity assays
* hERG channel inhibition
- Extract activity data for similar compounds
- Calculate hit rates for concerning assays
Step 5: Train toxicity prediction models with DeepChem
- Load Tox21 dataset from DeepChem
- Train graph convolutional models for:
* Nuclear receptor signaling
* Stress response pathways
* Genotoxicity endpoints
- Validate models with cross-validation
- Predict toxicity for candidate molecules
Step 6: Predict hERG cardiotoxicity liability
- Train DeepChem model on hERG inhibition data from ChEMBL
- Predict IC50 for hERG channel
- Flag compounds with predicted IC50 < 10 μM
- Identify structural features associated with hERG liability
Step 7: Predict hepatotoxicity risk
- Train models on DILI (drug-induced liver injury) datasets
- Extract features: reactive metabolites, mitochondrial toxicity
- Predict hepatotoxicity risk class (low/medium/high)
- Use SHAP values to explain predictions
Step 8: Predict metabolic stability and metabolites
- Identify sites of metabolism using RDKit SMARTS patterns
- Predict CYP450 interactions
- Query HMDB for potential metabolite structures
- Assess if metabolites contain toxic substructures
- Predict metabolic stability (half-life)
Step 9: Check FDA adverse event database
- Query FAERS for approved drugs similar to candidates
- Extract common adverse events
- Identify target organ toxicities
- Calculate reporting odds ratios for serious events
Step 10: Literature review of toxicity mechanisms
- PubMed search: "[scaffold] AND (toxicity OR hepatotoxicity OR cardiotoxicity)"
- Identify mechanistic studies on similar compounds
- Note any case reports of adverse events
- Review preclinical and clinical safety data
Step 11: Assess ADME liabilities
- Predict solubility, permeability, plasma protein binding
- Identify potential drug-drug interaction risks
- Assess blood-brain barrier penetration (for CNS or non-CNS drugs)
- Evaluate metabolic stability
Step 12: Generate safety assessment report
- Executive summary of safety profile for each candidate
- Red flags: structural alerts, predicted toxicities
- Yellow flags: moderate concerns requiring testing
- Green light: acceptable predicted safety profile
- Comparison table of all candidates
- Recommendations for risk mitigation:
* Structural modifications to reduce toxicity
* Priority in vitro assays to run
* Preclinical study design recommendations
- Comprehensive PDF report with:
* Toxicophore analysis
* Prediction model results with confidence
* SHAP interpretation plots
* Literature evidence
* Risk assessment matrix
Expected Output:
- Toxicity predictions for all candidates
- Structural alert analysis
- hERG, hepatotoxicity, and genotoxicity risk scores
- Metabolite predictions
- Prioritized list with safety rankings
- Comprehensive toxicology assessment report
Clinical Trial Analysis
Example 8: Competitive Landscape Analysis for New Indication
Objective: Analyze the clinical trial landscape for a specific indication to inform development strategy.
Skills Used:
clinicaltrials-database- Trial registryfda-database- Drug approvalspubmed-database- Published resultsdrugbank-database- Approved drugsopentargets-database- Target validationpolars- Data manipulationmatplotlib- Visualizationseaborn- Statistical plotsreportlab- Report generation
Workflow:
Step 1: Search ClinicalTrials.gov for all trials in indication
- Query: "[disease/indication]"
- Filter: All phases, all statuses
- Extract fields:
* NCT ID, title, phase, status
* Start date, completion date, enrollment
* Intervention/drug names
* Primary/secondary outcomes
* Sponsor and collaborators
- Export to structured JSON/CSV
Step 2: Categorize trials by mechanism of action
- Extract drug names and intervention types
- Query DrugBank for mechanism of action
- Query Open Targets for target information
- Classify into categories:
* Small molecules vs biologics
* Target class (kinase inhibitor, antibody, etc.)
* Novel vs repurposing
Step 3: Analyze trial phase progression
- Calculate success rates by phase (I → II, II → III)
- Identify terminated trials and reasons for termination
- Track time from phase I start to NDA submission
- Calculate median development timelines
Step 4: Search FDA database for recent approvals
- Query FDA drug approvals in the indication (last 10 years)
- Extract approval dates, indications, priority review status
- Note any accelerated approvals or breakthroughs
- Review FDA drug labels for safety information
Step 5: Extract outcome measures
- Compile all primary endpoints used
- Identify most common endpoints:
* Survival (OS, PFS, DFS)
* Response rates (ORR, CR, PR)
* Biomarker endpoints
* Patient-reported outcomes
- Note emerging or novel endpoints
Step 6: Analyze competitive dynamics
- Identify leading companies and their pipelines
- Map trials by phase for each major competitor
- Note partnership and licensing deals
- Assess crowded vs underserved patient segments
Step 7: Search PubMed for published trial results
- Query: "[NCT ID]" for each completed trial
- Extract published outcomes and conclusions
- Identify trends in efficacy and safety
- Note any unmet needs highlighted in discussions
Step 8: Analyze target validation evidence
- Query Open Targets for target-disease associations
- Extract genetic evidence scores
- Review tractability assessments
- Compare targets being pursued across trials
Step 9: Identify unmet needs and opportunities
- Analyze trial failures for common patterns
- Identify patient populations excluded from trials
- Note resistance mechanisms or limitations mentioned
- Assess gaps in current therapeutic approaches
Step 10: Perform temporal trend analysis
- Plot trial starts over time (by phase, mechanism)
- Identify increasing or decreasing interest in targets
- Correlate with publication trends and scientific advances
- Predict future trends in the space
Step 11: Create comprehensive visualizations
- Timeline of all trials (Gantt chart style)
- Phase distribution pie chart
- Mechanism of action breakdown
- Geographic distribution of trials
- Enrollment trends over time
- Success rate funnels (Phase I → II → III → Approval)
- Sponsor/company market share
Step 12: Generate competitive intelligence report
- Executive summary of competitive landscape
- Total number of active programs by phase
- Key players and their development stage
- Standard of care and approved therapies
- Emerging approaches and novel targets
- Identified opportunities and white space
- Risk analysis (crowded targets, high failure rates)
- Strategic recommendations:
* Patient population to target
* Differentiation strategies
* Partnership opportunities
* Regulatory pathway considerations
- Export as professional PDF with citations and data tables
Expected Output:
- Comprehensive trial database for indication
- Success rate and timeline statistics
- Competitive landscape mapping
- Unmet need analysis
- Strategic recommendations
- Publication-ready report with visualizations
Metabolomics & Systems Biology
Example 9: Multi-Omics Integration for Metabolic Disease
Objective: Integrate transcriptomics, proteomics, and metabolomics to identify dysregulated pathways in metabolic disease.
Skills Used:
pydeseq2- RNA-seq analysispyopenms- Mass spectrometryhmdb-database- Metabolite identificationmetabolomics-workbench-database- Public datasetskegg-database- Pathway mappingreactome-database- Pathway analysisstring-database- Protein interactionsstatsmodels- Multi-omics correlationnetworkx- Network analysispymc- Bayesian modeling
Workflow:
Step 1: Process RNA-seq data
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare disease vs control (adjusted p < 0.05, |LFC| > 1)
- Extract gene symbols and fold changes
- Map to KEGG gene IDs
Step 2: Process proteomics data
- Load LC-MS/MS results with PyOpenMS
- Perform peptide identification and quantification
- Normalize protein abundances
- Run statistical testing (t-test or limma)
- Extract significant proteins (p < 0.05, |FC| > 1.5)
Step 3: Process metabolomics data
- Load untargeted metabolomics data (mzML format) with PyOpenMS
- Perform peak detection and alignment
- Match features to HMDB database by accurate mass
- Annotate metabolites with MS/MS fragmentation
- Extract putative identifications (Level 2/3)
- Perform statistical analysis (FDR < 0.05, |FC| > 2)
Step 4: Search Metabolomics Workbench for public data
- Query for same disease or tissue type
- Download relevant studies
- Reprocess for consistency with own data
- Use as validation cohort
Step 5: Map all features to KEGG pathways
- Map genes to KEGG orthology (KO) terms
- Map proteins to KEGG identifiers
- Map metabolites to KEGG compound IDs
- Identify pathways with multi-omics coverage
Step 6: Perform pathway enrichment analysis
- Test for enrichment in KEGG pathways
- Test for enrichment in Reactome pathways
- Apply Fisher's exact test with multiple testing correction
- Focus on pathways with hits in ≥2 omics layers
Step 7: Build protein-metabolite networks
- Query STRING for protein-protein interactions
- Map proteins to KEGG reactions
- Connect enzymes to their substrates/products
- Build integrated network with genes → proteins → metabolites
Step 8: Network topology analysis with NetworkX
- Calculate node centrality (degree, betweenness)
- Identify hub metabolites and key enzymes
- Find bottleneck reactions
- Detect network modules with community detection
- Identify dysregulated subnetworks
Step 9: Correlation analysis across omics layers
- Calculate Spearman correlations between:
* Gene expression and protein abundance
* Protein abundance and metabolite levels
* Gene expression and metabolites (for enzyme-product pairs)
- Use statsmodels for significance testing
- Focus on enzyme-metabolite pairs with expected relationships
Step 10: Bayesian network modeling with PyMC
- Build probabilistic graphical model of pathway
- Model causal relationships: gene → protein → metabolite
- Incorporate prior knowledge from KEGG/Reactome
- Perform inference to identify key regulatory nodes
- Estimate effect sizes and uncertainties
Step 11: Identify therapeutic targets
- Prioritize enzymes with:
* Significant changes in all three omics layers
* High network centrality
* Druggable target class (kinases, transporters, etc.)
- Query DrugBank for existing inhibitors
- Search PubMed for validation in disease models
Step 12: Create comprehensive multi-omics report
- Summary statistics for each omics layer
- Venn diagram of overlapping pathway hits
- Pathway enrichment dot plots
- Integrated network visualization (color by fold change)
- Correlation heatmaps (enzyme-metabolite pairs)
- Bayesian network structure
- Table of prioritized therapeutic targets
- Biological interpretation and mechanistic insights
- Generate publication-quality figures
- Export PDF report with all results
Expected Output:
- Integrated multi-omics dataset
- Dysregulated pathway identification
- Multi-omics network model
- Prioritized list of therapeutic targets
- Comprehensive systems biology report
Materials Science & Chemistry
Example 10: High-Throughput Materials Discovery for Battery Applications
Objective: Discover novel solid electrolyte materials for lithium-ion batteries using computational screening.
Skills Used:
pymatgen- Materials analysismatminer- Feature engineeringscikit-learn- Machine learningpymoo- Multi-objective optimizationase- Atomic simulationsympy- Symbolic mathvaex- Large dataset handlingmatplotlib- Visualizationscientific-writing- Report generation
Workflow:
Step 1: Generate candidate materials library
- Use Pymatgen to enumerate compositions:
* Li-containing compounds (Li₁₋ₓM₁₊ₓX₂)
* M = transition metals (Zr, Ti, Ta, Nb)
* X = O, S, Se
- Generate ~10,000 candidate compositions
- Apply charge neutrality constraints
Step 2: Filter by thermodynamic stability
- Query Materials Project database via Pymatgen
- Calculate formation energy from elements
- Calculate energy above convex hull (E_hull)
- Filter: E_hull < 50 meV/atom (likely stable)
- Retain ~2,000 thermodynamically plausible compounds
Step 3: Predict crystal structures
- Use Pymatgen structure predictor
- Generate most likely crystal structures for each composition
- Consider common structure types: LISICON, NASICON, garnet, perovskite
- Calculate structural descriptors
Step 4: Calculate material properties with Pymatgen
- Lattice parameters and volume
- Density
- Packing fraction
- Ionic radii and bond lengths
- Coordination environments
Step 5: Feature engineering with matminer
- Calculate compositional features:
* Elemental property statistics (electronegativity, ionic radius)
* Valence electron concentrations
* Stoichiometric attributes
- Calculate structural features:
* Pore size distribution
* Site disorder parameters
* Partial radial distribution functions
Step 6: Build ML models for Li⁺ conductivity prediction
- Collect training data from literature (experimental conductivities)
- Train ensemble models with scikit-learn:
* Random Forest
* Gradient Boosting
* Neural Network
- Use 5-fold cross-validation
- Predict ionic conductivity for all candidates
Step 7: Predict additional properties
- Electrochemical stability window (ML model)
- Mechanical properties (bulk modulus, shear modulus)
- Interfacial resistance (estimate from structure)
- Synthesis temperature (ML prediction from similar compounds)
Step 8: Multi-objective optimization with PyMOO
Define optimization objectives:
- Maximize: ionic conductivity (>10⁻³ S/cm target)
- Maximize: electrochemical window (>4.5V target)
- Minimize: synthesis temperature (<800°C preferred)
- Minimize: cost (based on elemental abundance)
Run NSGA-II to find Pareto optimal solutions
Extract top 50 candidates from Pareto front
Step 9: Analyze Pareto optimal materials
- Identify composition trends (which elements appear frequently)
- Analyze structure-property relationships
- Calculate trade-offs between objectives
- Identify "sweet spot" compositions
Step 10: Validate predictions with DFT calculations
- Select top 10 candidates for detailed study
- Set up DFT calculations (VASP-like, if available via ASE)
- Calculate:
* Accurate formation energies
* Li⁺ migration barriers (NEB calculations)
* Electronic band gap
* Elastic constants
- Compare DFT results with ML predictions
Step 11: Literature and patent search
- Search for prior art on top candidates
- PubMed and Google Scholar: "[composition] AND electrolyte"
- USPTO: Check for existing patents on similar compositions
- Identify any experimental reports on related materials
Step 12: Generate materials discovery report
- Summary of screening workflow and statistics
- Pareto front visualization (conductivity vs stability vs cost)
- Structure visualization of top candidates
- Property comparison table
- Composition-property trend analysis
- DFT validation results
- Predicted performance vs state-of-art materials
- Synthesis recommendations
- IP landscape summary
- Prioritized list of 5-10 materials for experimental validation
- Export as publication-ready PDF
Expected Output:
- Screened library of 10,000+ materials
- ML models for property prediction
- Pareto-optimal set of 50 candidates
- Detailed analysis of top 10 materials
- DFT validation results
- Comprehensive materials discovery report
Digital Pathology
Example 11: Automated Tumor Detection in Whole Slide Images
Objective: Develop and validate a deep learning model for automated tumor detection in histopathology images.
Skills Used:
histolab- Whole slide image processingpathml- Computational pathologypytorch-lightning- Deep learningtorchvision- Image modelsscikit-learn- Model evaluationpydicom- DICOM handlingomero-integration- Image managementmatplotlib- Visualizationshap- Model interpretability
Workflow:
Step 1: Load whole slide images with HistoLab
- Load WSI files (SVS, TIFF formats)
- Extract slide metadata and magnification levels
- Visualize slide thumbnails
- Inspect tissue area vs background
Step 2: Tile extraction and preprocessing
- Use HistoLab to extract tiles (256×256 pixels at 20× magnification)
- Filter tiles:
* Remove background (tissue percentage > 80%)
* Apply color normalization (Macenko or Reinhard method)
* Filter out artifacts and bubbles
- Extract ~100,000 tiles per slide across all slides
Step 3: Create annotations (if training from scratch)
- Load pathologist annotations (if available via OMERO)
- Convert annotations to tile-level labels
- Categories: tumor, stroma, necrosis, normal
- Balance classes through stratified sampling
Step 4: Set up PathML pipeline
- Create PathML SlideData objects
- Define preprocessing pipeline:
* Stain normalization
* Color augmentation (HSV jitter)
* Rotation and flipping
- Split data: 70% train, 15% validation, 15% test
Step 5: Build deep learning model with PyTorch Lightning
- Architecture: ResNet50 or EfficientNet backbone
- Add custom classification head for tissue types
- Define training pipeline:
* Loss function: Cross-entropy or Focal loss
* Optimizer: Adam with learning rate scheduling
* Augmentations: rotation, flip, color jitter, elastic deformation
* Batch size: 32
* Mixed precision training
Step 6: Train model
- Train on tile-level labels
- Monitor metrics: accuracy, F1 score, AUC
- Use early stopping on validation loss
- Save best model checkpoint
- Training time: ~6-12 hours on GPU
Step 7: Evaluate model performance
- Test on held-out test set
- Calculate metrics with scikit-learn:
* Accuracy, precision, recall, F1 per class
* Confusion matrix
* ROC curves and AUC
- Compute confidence intervals with bootstrapping
Step 8: Slide-level aggregation
- Apply model to all tiles in each test slide
- Aggregate predictions:
* Majority voting
* Weighted average by confidence
* Spatial smoothing with convolution
- Generate probability heatmaps overlaid on WSI
Step 9: Model interpretability with SHAP
- Apply GradCAM or SHAP to explain predictions
- Visualize which regions contribute to tumor classification
- Generate attention maps showing model focus
- Validate that model attends to relevant histological features
Step 10: Clinical validation
- Compare model predictions with pathologist diagnosis
- Calculate inter-rater agreement (kappa score)
- Identify discordant cases for review
- Analyze error types: false positives, false negatives
Step 11: Integration with OMERO
- Upload processed slides and heatmaps to OMERO server
- Attach model predictions as slide metadata
- Enable pathologist review interface
- Store annotations and corrections for model retraining
Step 12: Generate clinical validation report
- Model architecture and training details
- Performance metrics with confidence intervals
- Slide-level accuracy vs pathologist ground truth
- Heatmap visualizations for representative cases
- Analysis of failure modes
- Comparison with published methods
- Discussion of clinical applicability
- Recommendations for deployment and monitoring
- Export PDF report for regulatory submission (if needed)
Expected Output:
- Trained deep learning model for tumor detection
- Tile-level and slide-level predictions
- Probability heatmaps for visualization
- Performance metrics and validation results
- Model interpretation visualizations
- Clinical validation report
Lab Automation & Protocol Design
Example 12: Automated High-Throughput Screening Protocol
Objective: Design and execute an automated compound screening workflow using liquid handling robots.
Skills Used:
pylabrobot- Lab automationopentrons-integration- Opentrons protocolbenchling-integration- Sample trackingprotocolsio-integration- Protocol documentationsimpy- Process simulationpolars- Data processingmatplotlib- Plate visualizationreportlab- Report generation
Workflow:
Step 1: Define screening campaign in Benchling
- Create compound library in Benchling registry
- Register all compounds with structure, concentration, location
- Define plate layouts (384-well format)
- Track compound source plates in inventory
- Set up ELN entry for campaign documentation
Step 2: Design assay protocol
- Define assay steps:
* Dispense cells (5000 cells/well)
* Add compounds (dose-response curve, 10 concentrations)
* Incubate 48 hours at 37°C
* Add detection reagent (cell viability assay)
* Read luminescence signal
- Calculate required reagent volumes
- Document protocol in Protocols.io
- Share with team for review
Step 3: Simulate workflow with SimPy
- Model liquid handler, incubator, plate reader as resources
- Simulate timing for 20 plates (7,680 wells)
- Identify bottlenecks (plate reader reads take 5 min/plate)
- Optimize scheduling: stagger plate processing
- Validate that throughput goal is achievable (20 plates/day)
Step 4: Design plate layout
- Use PyLabRobot to generate plate maps:
* Columns 1-2: positive controls (DMSO)
* Columns 3-22: compound titrations (10 concentrations in duplicate)
* Columns 23-24: negative controls (cytotoxic control)
- Randomize compound positions across plates
- Account for edge effects (avoid outer wells for samples)
- Export plate maps to CSV
Step 5: Create Opentrons protocol for cell seeding
- Write Python protocol using Opentrons API 2.0
- Steps:
* Aspirate cells from reservoir
* Dispense 40 μL cell suspension per well
* Tips: use P300 multi-channel for speed
* Include mixing steps to prevent settling
- Simulate protocol in Opentrons app
- Test on one plate before full run
Step 6: Create Opentrons protocol for compound addition
- Acoustic liquid handler (Echo) or pin tool for nanoliter transfers
- If using Opentrons:
* Source: 384-well compound plates
* Transfer 100 nL compound (in DMSO) to assay plates
* Use P20 for precision
* Prepare serial dilutions on deck if needed
- Account for DMSO normalization (1% final)
Step 7: Integrate with Benchling for sample tracking
- Use Benchling API to:
* Retrieve compound information (structure, batch, concentration)
* Log plate creation in inventory
* Create transfer records for audit trail
* Link assay plates to ELN entry
Step 8: Execute automated workflow
- Day 1: Seed cells with Opentrons
- Day 1 (4h later): Add compounds with Opentrons
- Day 3: Add detection reagent (manual or automated)
- Day 3 (2h later): Read plates on plate reader
- Store plates at 4°C between steps
Step 9: Collect and process data
- Export raw luminescence data from plate reader
- Load data with Polars for fast processing
- Normalize data:
* Subtract background (media-only wells)
* Calculate % viability relative to DMSO control
* Apply plate-wise normalization to correct systematic effects
- Quality control:
* Z' factor calculation (> 0.5 for acceptable assay)
* Coefficient of variation for controls (< 10%)
* Flag plates with poor QC metrics
Step 10: Dose-response curve fitting
- Fit 4-parameter logistic curves for each compound
- Calculate IC50, Hill slope, max/min response
- Use scikit-learn or scipy for curve fitting
- Compute 95% confidence intervals
- Flag compounds with poor curve fits (R² < 0.8)
Step 11: Hit identification and triage
- Define hit criteria:
* IC50 < 10 μM
* Max inhibition > 50%
* Curve quality: R² > 0.8
- Prioritize hits by potency
- Check for PAINS patterns with RDKit
- Cross-reference with known aggregators/frequent hitters
Step 12: Visualize results and generate report
- Create plate heatmaps showing % viability
- Dose-response curve plots for hits
- Scatter plot: potency vs max effect
- QC metric summary across plates
- Structure visualization of top 20 hits
- Generate campaign summary report:
* Screening statistics (compounds tested, hit rate)
* QC metrics and data quality assessment
* Hit list with structures and IC50 values
* Protocol documentation from Protocols.io
* Raw data files and analysis code
* Recommendations for confirmation assays
- Update Benchling ELN with results
- Export PDF report for stakeholders
Expected Output:
- Automated screening protocols (Opentrons Python files)
- Executed screen of 384-well plates
- Quality-controlled dose-response data
- Hit list with IC50 values
- Comprehensive screening report
Agricultural Genomics
Example 13: GWAS for Crop Yield Improvement
Objective: Identify genetic markers associated with drought tolerance and yield in a crop species.
Skills Used:
biopython- Sequence analysispysam- VCF processinggwas-database- Public GWAS dataensembl-database- Plant genomicsgene-database- Gene annotationscanpy- Population structure (adapted for genetic data)scikit-learn- PCA and clusteringstatsmodels- Association testingmatplotlib- Manhattan plotsseaborn- Visualization
Workflow:
Step 1: Load and QC genotype data
- Load VCF file with pysam
- Filter variants:
* Call rate > 95%
* Minor allele frequency (MAF) > 5%
* Hardy-Weinberg equilibrium p > 1e-6
- Convert to numeric genotype matrix (0, 1, 2)
- Retain ~500,000 SNPs after QC
Step 2: Assess population structure
- Calculate genetic relationship matrix
- Perform PCA with scikit-learn (use top 10 PCs)
- Visualize population structure (PC1 vs PC2)
- Identify distinct subpopulations or admixture
- Note: will use PCs as covariates in GWAS
Step 3: Load and process phenotype data
- Drought tolerance score (1-10 scale, measured under stress)
- Grain yield (kg/hectare)
- Days to flowering
- Plant height
- Quality control:
* Remove outliers (> 3 SD from mean)
* Transform if needed (log or rank-based for skewed traits)
* Adjust for environmental covariates (field, year)
Step 4: Calculate kinship matrix
- Compute genetic relatedness matrix
- Account for population structure and relatedness
- Will use in mixed linear model to control for confounding
Step 5: Run genome-wide association study
- For each phenotype, test association with each SNP
- Use mixed linear model (MLM) in statsmodels:
* Fixed effects: SNP genotype, PCs (top 10)
* Random effects: kinship matrix
* Bonferroni threshold: p < 5e-8 (genome-wide significance)
- Multiple testing correction: Bonferroni or FDR
- Calculate genomic inflation factor (λ) to check for inflation
Step 6: Identify significant associations
- Extract SNPs passing significance threshold
- Determine lead SNPs (most significant in each locus)
- Define loci: extend ±500 kb around lead SNP
- Identify independent associations via conditional analysis
Step 7: Annotate significant loci
- Map SNPs to genes using Ensembl Plants API
- Identify genic vs intergenic SNPs
- For genic SNPs:
* Determine consequence (missense, synonymous, intronic, UTR)
* Extract gene names and descriptions
- Query NCBI Gene for gene function
- Prioritize genes with known roles in stress response or development
Step 8: Search GWAS Catalog for prior reports
- Query GWAS Catalog for similar traits in same or related species
- Check for replication of known loci
- Identify novel vs known associations
Step 9: Functional enrichment analysis
- Extract all genes within significant loci
- Perform GO enrichment analysis
- Test for enrichment in KEGG pathways
- Focus on pathways related to:
* Drought stress response (ABA signaling, osmotic adjustment)
* Photosynthesis and carbon fixation
* Root development
Step 10: Estimate SNP heritability and genetic architecture
- Calculate variance explained by significant SNPs
- Estimate SNP-based heritability (proportion of variance explained)
- Assess genetic architecture: few large-effect vs many small-effect loci
Step 11: Build genomic prediction model
- Train genomic selection model with scikit-learn:
* Ridge regression (GBLUP equivalent)
* Elastic net
* Random Forest
- Use all SNPs (not just significant ones)
- Cross-validate to predict breeding values
- Assess prediction accuracy
Step 12: Generate GWAS report
- Manhattan plots for each trait
- QQ plots to assess test calibration
- Regional association plots for significant loci
- Gene models overlaid on loci
- Table of significant SNPs with annotations
- Functional enrichment results
- Genomic prediction accuracy
- Biological interpretation:
* Candidate genes for drought tolerance
* Potential molecular mechanisms
* Implications for breeding programs
- Recommendations:
* SNPs to use for marker-assisted selection
* Genes for functional validation
* Crosses to generate mapping populations
- Export publication-quality PDF with all results
Expected Output:
- Significant SNP-trait associations
- Annotated candidate genes
- Functional enrichment analysis
- Genomic prediction models
- Comprehensive GWAS report
- Recommendations for breeding programs
Neuroscience & Brain Imaging
Example 14: Brain Connectivity Analysis from fMRI Data
Objective: Analyze resting-state fMRI data to identify altered brain connectivity patterns in disease.
Skills Used:
neurokit2- Neurophysiological signal processingnilearn(external) - Neuroimaging analysisscikit-learn- Classification and clusteringnetworkx- Graph theory analysisstatsmodels- Statistical testingtorch_geometric- Graph neural networkspymc- Bayesian modelingmatplotlib- Brain visualizationseaborn- Connectivity matrices
Workflow:
Step 1: Load and preprocess fMRI data
# Note: Use nilearn or similar for fMRI-specific preprocessing
- Load 4D fMRI images (BOLD signal)
- Preprocessing:
* Motion correction (realignment)
* Slice timing correction
* Spatial normalization to MNI space
* Smoothing (6mm FWHM Gaussian kernel)
* Temporal filtering (0.01-0.1 Hz bandpass)
* Nuisance regression (motion, CSF, white matter)
Step 2: Define brain regions (parcellation)
- Apply brain atlas (e.g., AAL, Schaefer 200-region atlas)
- Extract average time series for each region
- Result: 200 time series per subject (one per brain region)
Step 3: Signal cleaning with NeuroKit2
- Denoise time series
- Remove physiological artifacts
- Apply additional bandpass filtering if needed
- Identify and handle outlier time points
Step 4: Calculate functional connectivity
- Compute pairwise Pearson correlations between all regions
- Result: 200×200 connectivity matrix per subject
- Fisher z-transform correlations for group statistics
- Threshold weak connections (|r| < 0.2)
Step 5: Graph theory analysis with NetworkX
- Convert connectivity matrices to graphs
- Calculate global network metrics:
* Clustering coefficient (local connectivity)
* Path length (integration)
* Small-worldness (balance of segregation and integration)
* Modularity (community structure)
- Calculate node-level metrics:
* Degree centrality
* Betweenness centrality
* Eigenvector centrality
* Participation coefficient (inter-module connectivity)
Step 6: Statistical comparison between groups
- Compare patients vs healthy controls
- Use statsmodels for group comparisons:
* Paired or unpaired t-tests for connectivity edges
* FDR correction for multiple comparisons across all edges
* Identify edges with significantly different connectivity
- Compare global and node-level network metrics
- Calculate effect sizes (Cohen's d)
Step 7: Identify altered subnetworks
- Threshold statistical maps (FDR < 0.05)
- Identify clusters of altered connectivity
- Map to functional brain networks:
* Default mode network (DMN)
* Salience network (SN)
* Central executive network (CEN)
* Sensorimotor network
- Visualize altered connections on brain surfaces
Step 8: Machine learning classification
- Train classifier to distinguish patients from controls
- Use scikit-learn Random Forest or SVM
- Features: connectivity values or network metrics
- Cross-validation (10-fold)
- Calculate accuracy, sensitivity, specificity, AUC
- Identify most discriminative features (connectivity edges)
Step 9: Graph neural network analysis with Torch Geometric
- Build graph neural network (GCN or GAT)
- Input: connectivity matrices as adjacency matrices
- Train to predict diagnosis
- Extract learned representations
- Visualize latent space (UMAP)
- Interpret which brain regions are most important
Step 10: Bayesian network modeling with PyMC
- Build directed graphical model of brain networks
- Estimate effective connectivity (directional influence)
- Incorporate prior knowledge about anatomical connections
- Perform posterior inference
- Identify key driver regions in disease
Step 11: Clinical correlation analysis
- Correlate network metrics with clinical scores:
* Symptom severity
* Cognitive performance
* Treatment response
- Use Spearman or Pearson correlation
- Identify brain-behavior relationships
Step 12: Generate comprehensive neuroimaging report
- Brain connectivity matrices (patients vs controls)
- Statistical comparison maps on brain surface
- Network metric comparison bar plots
- Graph visualizations (circular or force-directed layout)
- Machine learning ROC curves
- Brain-behavior correlation plots
- Clinical interpretation:
* Which networks are disrupted?
* Relationship to symptoms
* Potential biomarker utility
- Recommendations:
* Brain regions for therapeutic targeting (TMS, DBS)
* Network metrics as treatment response predictors
- Export publication-ready PDF with brain visualizations
Expected Output:
- Functional connectivity matrices for all subjects
- Statistical maps of altered connectivity
- Graph theory metrics
- Machine learning classification model
- Brain-behavior correlations
- Comprehensive neuroimaging report
Environmental Microbiology
Example 15: Metagenomic Analysis of Environmental Samples
Objective: Characterize microbial community composition and functional potential from environmental DNA samples.
Skills Used:
biopython- Sequence processingpysam- BAM file handlingena-database- Sequence datauniprot-database- Protein annotationkegg-database- Pathway analysisetetoolkit- Phylogenetic treesscikit-bio- Microbial ecologynetworkx- Co-occurrence networksstatsmodels- Diversity statisticsmatplotlib- Visualization
Workflow:
Step 1: Load and QC metagenomic reads
- Load FASTQ files with BioPython
- Quality control with FastQC-equivalent:
* Remove adapters and low-quality bases (Q < 20)
* Filter short reads (< 50 bp)
* Remove host contamination (if applicable)
- Subsample to even depth if comparing samples
Step 2: Taxonomic classification
- Use Kraken2-like approach or query ENA database
- Classify reads to taxonomic lineages
- Generate abundance table:
* Rows: taxa (species or OTUs)
* Columns: samples
* Values: read counts or relative abundance
- Summarize at different levels: phylum, class, order, family, genus, species
Step 3: Calculate diversity metrics with scikit-bio
- Alpha diversity (within-sample):
* Richness (number of species)
* Shannon entropy
* Simpson diversity
* Chao1 estimated richness
- Beta diversity (between-sample):
* Bray-Curtis dissimilarity
* Weighted/unweighted UniFrac distance
* Jaccard distance
- Rarefaction curves to assess sampling completeness
Step 4: Statistical comparison of communities
- Compare diversity between groups (e.g., polluted vs pristine)
- Use statsmodels for:
* Mann-Whitney or Kruskal-Wallis tests (alpha diversity)
* PERMANOVA for beta diversity (adonis test)
* LEfSe for differential abundance testing
- Identify taxa enriched or depleted in each condition
Step 5: Build phylogenetic tree with ETE Toolkit
- Extract 16S rRNA sequences (or marker genes)
- Align sequences (MUSCLE/MAFFT equivalent)
- Build phylogenetic tree (neighbor-joining or maximum likelihood)
- Visualize tree colored by sample or environment
- Root tree with outgroup
Step 6: Co-occurrence network analysis
- Calculate pairwise correlations between taxa
- Use Spearman correlation to identify co-occurrence patterns
- Filter significant correlations (p < 0.01, |r| > 0.6)
- Build co-occurrence network with NetworkX
- Identify modules (communities of co-occurring taxa)
- Calculate network topology metrics
- Visualize network (nodes = taxa, edges = correlations)
Step 7: Functional annotation
- Assemble contigs from reads (if performing assembly)
- Predict genes with Prodigal-like tools
- Annotate genes using UniProt and KEGG
- Map proteins to KEGG pathways
- Generate functional profile:
* Abundance of metabolic pathways
* Key enzymes (nitrification, denitrification, methanogenesis)
* Antibiotic resistance genes
* Virulence factors
Step 8: Functional diversity analysis
- Compare functional profiles between samples
- Calculate pathway richness and evenness
- Identify enriched pathways with statistical testing
- Link taxonomy to function:
* Which taxa contribute to which functions?
* Use shotgun data to assign functions to taxa
Step 9: Search ENA for related environmental samples
- Query ENA for metagenomic studies from similar environments
- Download and compare to own samples
- Place samples in context of global microbiome diversity
- Identify unique vs ubiquitous taxa
Step 10: Environmental parameter correlation
- Correlate community composition with metadata:
* Temperature, pH, salinity
* Nutrient concentrations (N, P)
* Pollutant levels (heavy metals, hydrocarbons)
- Use Mantel test to correlate distance matrices
- Identify environmental drivers of community structure
Step 11: Biomarker discovery
- Identify taxa or pathways that correlate with environmental condition
- Use Random Forest to find predictive features
- Validate biomarkers:
* Sensitivity and specificity
* Cross-validation across samples
- Propose taxa as bioindicators of environmental health
Step 12: Generate environmental microbiome report
- Taxonomic composition bar charts (stacked by phylum/class)
- Alpha and beta diversity plots (boxplots, PCoA)
- Phylogenetic tree with environmental context
- Co-occurrence network visualization
- Functional pathway heatmaps
- Environmental correlation plots
- Statistical comparison tables
- Biological interpretation:
* Dominant taxa and their ecological roles
* Functional potential of the community
* Environmental factors shaping the microbiome
* Biomarker taxa for monitoring
- Recommendations:
* Biomarkers for environmental monitoring
* Functional guilds for restoration
* Further sampling or sequencing strategies
- Export comprehensive PDF report
Expected Output:
- Taxonomic profiles for all samples
- Diversity metrics and statistical comparisons
- Phylogenetic tree
- Co-occurrence network
- Functional annotation and pathway analysis
- Comprehensive microbiome report
Infectious Disease Research
Example 16: Antimicrobial Resistance Surveillance and Prediction
Objective: Track antimicrobial resistance trends and predict resistance phenotypes from genomic data.
Skills Used:
biopython- Sequence analysispysam- Genome assembly analysisena-database- Public genomic datauniprot-database- Resistance protein annotationgene-database- Resistance gene catalogsetetoolkit- Phylogenetic analysisscikit-learn- Resistance predictionnetworkx- Transmission networksstatsmodels- Trend analysismatplotlib- Epidemiological plots
Workflow:
Step 1: Collect bacterial genome sequences
- Isolates from hospital surveillance program
- Load FASTA assemblies with BioPython
- Basic QC:
* Assess assembly quality (N50, completeness)
* Estimate genome size and coverage
* Remove contaminated assemblies
Step 2: Species identification and MLST typing
- Perform in silico MLST (multi-locus sequence typing)
- Extract housekeeping gene sequences
- Assign sequence types (ST)
- Classify isolates into clonal complexes
- Identify high-risk clones (e.g., ST131 E. coli, ST258 K. pneumoniae)
Step 3: Antimicrobial resistance (AMR) gene detection
- Query NCBI Gene and UniProt for AMR gene databases
- Screen assemblies for resistance genes:
* Beta-lactamases (blaTEM, blaCTX-M, blaKPC, blaNDM)
* Aminoglycoside resistance (aac, aph, ant)
* Fluoroquinolone resistance (gyrA, parC mutations)
* Colistin resistance (mcr-1 to mcr-10)
* Efflux pumps
- Calculate gene presence/absence matrix
Step 4: Resistance mechanism annotation
- Map detected genes to resistance classes:
* Enzymatic modification (e.g., beta-lactamases)
* Target modification (e.g., ribosomal methylation)
* Target mutation (e.g., fluoroquinolone resistance)
* Efflux pumps
- Query UniProt for detailed mechanism descriptions
- Link genes to antibiotic classes affected
Step 5: Build phylogenetic tree with ETE Toolkit
- Extract core genome SNPs
- Concatenate SNP alignments
- Build maximum likelihood tree
- Root with outgroup or midpoint rooting
- Annotate tree with:
* Resistance profiles
* Sequence types
* Collection date and location
Step 6: Genotype-phenotype correlation
- Match genomic data with phenotypic susceptibility testing
- For each antibiotic, correlate:
* Presence of resistance genes with MIC values
* Target mutations with resistance phenotype
- Calculate sensitivity/specificity of genetic markers
- Identify discordant cases (false positives/negatives)
Step 7: Machine learning resistance prediction
- Train classification models with scikit-learn:
* Features: presence/absence of resistance genes + mutations
* Target: resistance phenotype (susceptible/intermediate/resistant)
* Models: Logistic Regression, Random Forest, Gradient Boosting
- Train separate models for each antibiotic
- Cross-validate (stratified 5-fold)
- Calculate accuracy, precision, recall, F1 score
- Feature importance: which genes are most predictive?
Step 8: Temporal trend analysis
- Track resistance rates over time
- Use statsmodels for:
* Mann-Kendall trend test
* Joinpoint regression (identify change points)
* Forecast future resistance rates (ARIMA)
- Analyze trends for each antibiotic class
- Identify emerging resistance mechanisms
Step 9: Transmission network inference
- Identify closely related isolates (< 10 SNPs difference)
- Build transmission network with NetworkX:
* Nodes: isolates
* Edges: putative transmission links
- Incorporate temporal and spatial data
- Identify outbreak clusters
- Detect super-spreaders (high degree nodes)
- Analyze network topology
Step 10: Search ENA for global context
- Query ENA for same species from other regions/countries
- Download representative genomes
- Integrate into phylogenetic analysis
- Assess whether local isolates are globally distributed clones
- Identify region-specific vs international resistance genes
Step 11: Plasmid and mobile element analysis
- Identify plasmid contigs
- Detect insertion sequences and transposons
- Track mobile genetic elements carrying resistance genes
- Identify conjugative plasmids facilitating horizontal gene transfer
- Build plasmid similarity networks
Step 12: Generate AMR surveillance report
- Summary statistics:
* Number of isolates by species, ST, location
* Resistance rates for each antibiotic
- Phylogenetic tree annotated with resistance profiles
- Temporal trend plots (resistance % over time)
- Transmission network visualizations
- Prediction model performance metrics
- Heatmap: resistance genes by isolate
- Geographic distribution map (if spatial data available)
- Interpretation:
* Predominant resistance mechanisms
* High-risk clones circulating
* Temporal trends and emerging threats
* Transmission clusters and outbreaks
- Recommendations:
* Infection control measures for clusters
* Antibiotic stewardship priorities
* Resistance genes to monitor
* Laboratories to perform confirmatory testing
- Export comprehensive PDF for public health reporting
Expected Output:
- AMR gene profiles for all isolates
- Phylogenetic tree with resistance annotations
- Temporal trends in resistance rates
- ML models for resistance prediction from genomes
- Transmission networks
- Comprehensive AMR surveillance report for public health
Multi-Omics Integration
Example 17: Integrative Analysis of Cancer Multi-Omics Data
Objective: Integrate genomics, transcriptomics, proteomics, and clinical data to identify cancer subtypes and therapeutic strategies.
Skills Used:
pydeseq2- RNA-seq DE analysispysam- Variant callingensembl-database- Gene annotationcosmic-database- Cancer mutationsstring-database- Protein interactionsreactome-database- Pathway analysisopentargets-database- Drug targetsscikit-learn- Clustering and classificationtorch_geometric- Graph neural networksumap-learn- Dimensionality reductionstatsmodels- Survival analysispymoo- Multi-objective optimization
Workflow:
Step 1: Load and preprocess genomic data (WES/WGS)
- Parse VCF files with pysam
- Filter high-quality variants (QUAL > 30, DP > 20)
- Annotate with Ensembl VEP (missense, nonsense, frameshift)
- Query COSMIC for known cancer mutations
- Create mutation matrix: samples × genes (binary: mutated or not)
- Focus on cancer genes from COSMIC Cancer Gene Census
Step 2: Process transcriptomic data (RNA-seq)
- Load gene count matrix
- Run differential expression with PyDESeq2
- Compare tumor vs normal (if paired samples available)
- Normalize counts (TPM or FPKM)
- Identify highly variable genes
- Create expression matrix: samples × genes (log2 TPM)
Step 3: Load proteomic data (Mass spec)
- Protein abundance matrix from LC-MS/MS
- Normalize protein abundances (median normalization)
- Log2-transform
- Filter proteins detected in < 50% of samples
- Create protein matrix: samples × proteins
Step 4: Load clinical data
- Demographics: age, sex, race
- Tumor characteristics: stage, grade, histology
- Treatment: surgery, chemo, radiation, targeted therapy
- Outcome: overall survival (OS), progression-free survival (PFS)
- Response: complete/partial response, stable/progressive disease
Step 5: Data integration and harmonization
- Match sample IDs across omics layers
- Ensure consistent gene/protein identifiers
- Handle missing data:
* Impute with KNN or median (for moderate missingness)
* Remove features with > 50% missing
- Create multi-omics data structure (dictionary of matrices)
Step 6: Multi-omics dimensionality reduction
- Concatenate all omics features (genes + proteins + mutations)
- Apply UMAP with umap-learn for visualization
- Alternative: PCA or t-SNE
- Visualize samples in 2D space colored by:
* Histological subtype
* Stage
* Survival (high vs low)
- Identify patterns or clusters
Step 7: Unsupervised clustering to identify subtypes
- Perform consensus clustering with scikit-learn
- Test k = 2 to 10 clusters
- Evaluate cluster stability and optimal k
- Assign samples to clusters (subtypes)
- Visualize clustering in UMAP space
Step 8: Characterize molecular subtypes
For each subtype:
- Differential expression analysis:
* Compare subtype vs all others with PyDESeq2
* Extract top differentially expressed genes and proteins
- Mutation enrichment:
* Fisher's exact test for each gene
* Identify subtype-specific mutations
- Pathway enrichment:
* Query Reactome for enriched pathways
* Query KEGG for metabolic pathway differences
* Identify hallmark biological processes
Step 9: Build protein-protein interaction networks
- Query STRING database for interactions among:
* Differentially expressed proteins
* Products of mutated genes
- Construct PPI network with NetworkX
- Identify network modules (community detection)
- Calculate centrality metrics to find hub proteins
- Overlay fold changes on network for visualization
Step 10: Survival analysis by subtype
- Use statsmodels or lifelines for survival analysis
- Kaplan-Meier curves for each subtype
- Log-rank test for significance
- Cox proportional hazards model:
* Covariates: subtype, stage, age, treatment
* Estimate hazard ratios
- Identify prognostic subtypes
Step 11: Predict therapeutic response
- Train machine learning models with scikit-learn:
* Features: multi-omics data
* Target: response to specific therapy (responder/non-responder)
* Models: Random Forest, XGBoost, SVM
- Cross-validation to assess performance
- Identify features predictive of response
- Calculate AUC and feature importance
Step 12: Graph neural network for integrated prediction
- Build heterogeneous graph with Torch Geometric:
* Nodes: samples, genes, proteins, pathways
* Edges: gene-protein, protein-protein, gene-pathway
* Node features: expression, mutation status
- Train GNN to predict:
* Subtype classification
* Survival risk
* Treatment response
- Extract learned embeddings for interpretation
Step 13: Identify therapeutic targets with Open Targets
- For each subtype, query Open Targets:
* Input: upregulated genes/proteins
* Extract target-disease associations
* Prioritize by tractability score
- Search for FDA-approved drugs targeting identified proteins
- Identify clinical trials for relevant targets
- Propose subtype-specific therapeutic strategies
Step 14: Multi-objective optimization of treatment strategies
- Use PyMOO to optimize treatment selection:
* Objectives:
1. Maximize predicted response probability
2. Minimize predicted toxicity
3. Minimize cost
* Constraints: patient eligibility, drug availability
- Generate Pareto-optimal treatment strategies
- Personalized treatment recommendations per patient
Step 15: Generate comprehensive multi-omics report
- Sample clustering and subtype assignments
- UMAP visualization colored by subtype, survival, mutations
- Subtype characterization:
* Molecular signatures (genes, proteins, mutations)
* Enriched pathways
* PPI networks
- Kaplan-Meier survival curves by subtype
- ML model performance (AUC, confusion matrices)
- Feature importance plots
- Therapeutic target tables with supporting evidence
- Personalized treatment recommendations
- Clinical implications:
* Prognostic biomarkers
* Predictive biomarkers for therapy selection
* Novel drug targets
- Export publication-quality PDF with all figures and tables
Expected Output:
- Integrated multi-omics dataset
- Cancer subtype classification
- Molecular characterization of subtypes
- Survival analysis and prognostic markers
- Predictive models for treatment response
- Therapeutic target identification
- Personalized treatment strategies
- Comprehensive integrative genomics report
Experimental Physics & Data Analysis
Example 18: Analysis of Particle Physics Detector Data
Objective: Analyze experimental data from particle detector to identify signal events and measure physical constants.
Skills Used:
astropy- Units and constantssympy- Symbolic mathematicsscipy- Statistical analysisscikit-learn- Classificationstable-baselines3- Reinforcement learning for optimizationmatplotlib- Visualizationseaborn- Statistical plotsstatsmodels- Hypothesis testingdask- Large-scale data processingvaex- Out-of-core dataframes
Workflow:
Step 1: Load and inspect detector data
- Load ROOT files or HDF5 with raw detector signals
- Use Vaex for out-of-core processing (TBs of data)
- Inspect data structure: event IDs, timestamps, detector channels
- Extract key observables:
* Energy deposits in calorimeters
* Particle trajectories from tracking detectors
* Time-of-flight measurements
* Trigger information
Step 2: Apply detector calibration and corrections
- Load calibration constants
- Apply energy calibrations to convert ADC to physical units
- Correct for detector efficiency variations
- Apply geometric corrections (alignment)
- Use Astropy units for unit conversions (eV, GeV, MeV)
- Account for dead time and detector acceptance
Step 3: Event reconstruction
- Cluster energy deposits to form particle candidates
- Reconstruct particle trajectories (tracks)
- Match tracks to calorimeter clusters
- Calculate invariant masses for particle identification
- Compute momentum and energy for each particle
- Use Dask for parallel processing across events
Step 4: Event selection and filtering
- Define signal region based on physics hypothesis
- Apply quality cuts:
* Track quality (chi-squared, number of hits)
* Fiducial volume cuts
* Timing cuts (beam window)
* Particle identification cuts
- Estimate trigger efficiency
- Calculate event weights for corrections
Step 5: Background estimation
- Identify background sources:
* Cosmic rays
* Beam-related backgrounds
* Detector noise
* Physics backgrounds (non-signal processes)
- Simulate backgrounds using Monte Carlo (if available)
- Estimate background from data in control regions
- Use sideband subtraction method
Step 6: Signal extraction
- Fit invariant mass distributions to extract signal
- Use scipy for likelihood fitting:
* Signal model: Gaussian or Breit-Wigner
* Background model: polynomial or exponential
* Combined fit with maximum likelihood
- Calculate signal significance (S/√B or Z-score)
- Estimate systematic uncertainties
Step 7: Machine learning event classification
- Train classifier with scikit-learn to separate signal from background
- Features: kinematic variables, topology, detector response
- Models: Boosted Decision Trees (XGBoost), Neural Networks
- Cross-validate with k-fold CV
- Optimize selection criteria using ROC curves
- Calculate signal efficiency and background rejection
Step 8: Reinforcement learning for trigger optimization
- Use Stable-Baselines3 to optimize trigger thresholds
- Environment: detector simulator
- Action: adjust trigger thresholds
- Reward: maximize signal efficiency while controlling rate
- Train PPO or SAC agent
- Validate on real data
Step 9: Calculate physical observables
- Measure cross-sections:
* σ = N_signal / (ε × L × BR)
* N_signal: number of signal events
* ε: detection efficiency
* L: integrated luminosity
* BR: branching ratio
- Use Sympy for symbolic error propagation
- Calculate with Astropy for proper unit handling
Step 10: Statistical analysis and hypothesis testing
- Perform hypothesis tests with statsmodels:
* Likelihood ratio test for signal vs background-only
* Calculate p-values and significance levels
* Set confidence limits (CLs method)
- Bayesian analysis for parameter estimation
- Calculate confidence intervals and error bands
Step 11: Systematic uncertainty evaluation
- Identify sources of systematic uncertainty:
* Detector calibration uncertainties
* Background estimation uncertainties
* Theoretical uncertainties (cross-sections, PDFs)
* Monte Carlo modeling uncertainties
- Propagate uncertainties through analysis chain
- Combine statistical and systematic uncertainties
- Present as error budget
Step 12: Create comprehensive physics report
- Event displays showing candidate signal events
- Kinematic distributions (momentum, energy, angles)
- Invariant mass plots with fitted signal
- ROC curves for ML classifiers
- Cross-section measurements with error bars
- Comparison with theoretical predictions
- Systematic uncertainty breakdown
- Statistical significance calculations
- Interpretation:
* Consistency with Standard Model
* Constraints on new physics parameters
* Discovery potential or exclusion limits
- Recommendations:
* Detector improvements
* Additional data needed
* Future analysis strategies
- Export publication-ready PDF formatted for physics journal
Expected Output:
- Reconstructed physics events
- Signal vs background classification
- Measured cross-sections and branching ratios
- Statistical significance of observations
- Systematic uncertainty analysis
- Comprehensive experimental physics paper
Chemical Engineering & Process Optimization
Example 19: Optimization of Chemical Reactor Design and Operation
Objective: Design and optimize a continuous chemical reactor for maximum yield and efficiency while meeting safety and economic constraints.
Skills Used:
sympy- Symbolic equations and reaction kineticsscipy- Numerical integration and optimizationpymoo- Multi-objective optimizationsimpy- Process simulationpymc- Bayesian parameter estimationscikit-learn- Process modelingstable-baselines3- Real-time control optimizationmatplotlib- Process diagramsreportlab- Engineering reports
Workflow:
Step 1: Define reaction system and kinetics
- Chemical reaction: A + B → C + D
- Use Sympy to define symbolic rate equations:
* Arrhenius equation: k = A × exp(-Ea/RT)
* Rate law: r = k × [A]^α × [B]^β
- Define material and energy balances symbolically
- Include equilibrium constants and thermodynamics
- Account for side reactions and byproducts
Step 2: Develop reactor model
- Select reactor type: CSTR, PFR, batch, or semi-batch
- Write conservation equations:
* Mass balance: dC/dt = (F_in × C_in - F_out × C)/V + r
* Energy balance: ρCp × dT/dt = Q - ΔH_rxn × r × V
* Momentum balance (pressure drop)
- Include heat transfer correlations
- Model mixing and mass transfer limitations
Step 3: Parameter estimation with PyMC
- Load experimental data from pilot reactor
- Bayesian inference to estimate kinetic parameters:
* Pre-exponential factor (A)
* Activation energy (Ea)
* Reaction orders (α, β)
- Use MCMC sampling with PyMC
- Incorporate prior knowledge from literature
- Calculate posterior distributions and credible intervals
- Assess parameter uncertainty and correlation
Step 4: Model validation
- Simulate reactor with estimated parameters using scipy.integrate
- Compare predictions with experimental data
- Calculate goodness of fit (R², RMSE)
- Perform sensitivity analysis:
* Which parameters most affect yield?
* Identify critical operating conditions
- Refine model if needed
Step 5: Machine learning surrogate model
- Train fast surrogate model with scikit-learn
- Generate training data from detailed model (1000+ runs)
- Features: T, P, residence time, feed composition, catalyst loading
- Target: yield, selectivity, conversion
- Models: Gaussian Process Regression, Random Forest
- Validate surrogate accuracy (R² > 0.95)
- Use for rapid optimization
Step 6: Single-objective optimization
- Maximize yield with scipy.optimize:
* Decision variables: T, P, feed ratio, residence time
* Objective: maximize Y = (moles C produced) / (moles A fed)
* Constraints:
- Temperature: 300 K ≤ T ≤ 500 K (safety)
- Pressure: 1 bar ≤ P ≤ 50 bar (equipment limits)
- Residence time: 1 min ≤ τ ≤ 60 min
- Conversion: X_A ≥ 90%
- Use Sequential Least Squares Programming (SLSQP)
- Identify optimal operating point
Step 7: Multi-objective optimization with PyMOO
- Competing objectives:
1. Maximize product yield
2. Minimize energy consumption (heating/cooling)
3. Minimize operating cost (raw materials, utilities)
4. Maximize reactor productivity (throughput)
- Constraints:
- Safety: temperature and pressure limits
- Environmental: waste production limits
- Economic: minimum profitability
- Run NSGA-II or NSGA-III
- Generate Pareto front of optimal solutions
- Select operating point based on preferences
Step 8: Dynamic process simulation with SimPy
- Model complete plant:
* Reactors, separators, heat exchangers
* Pumps, compressors, valves
* Storage tanks and buffers
- Simulate startup, steady-state, and shutdown
- Include disturbances:
* Feed composition variations
* Equipment failures
* Demand fluctuations
- Evaluate dynamic stability
- Calculate time to steady state
Step 9: Control system design
- Design feedback control loops:
* Temperature control (PID controller)
* Pressure control
* Flow control
* Level control
- Tune PID parameters using Ziegler-Nichols or optimization
- Implement cascade control for improved performance
- Add feedforward control for disturbance rejection
Step 10: Reinforcement learning for advanced control
- Use Stable-Baselines3 to train RL agent:
* Environment: reactor simulation (SimPy-based)
* State: T, P, concentrations, flow rates
* Actions: adjust setpoints, flow rates, heating/cooling
* Reward: +yield -energy cost -deviation from setpoint
- Train PPO or TD3 agent
- Compare with conventional PID control
- Evaluate performance under disturbances
- Implement model-free adaptive control
Step 11: Economic analysis
- Calculate capital costs (CAPEX):
* Reactor vessel cost (function of size, pressure rating)
* Heat exchanger costs
* Pumps and instrumentation
* Installation costs
- Calculate operating costs (OPEX):
* Raw materials (A, B, catalyst)
* Utilities (steam, cooling water, electricity)
* Labor and maintenance
- Revenue from product sales
- Calculate economic metrics:
* Net present value (NPV)
* Internal rate of return (IRR)
* Payback period
* Levelized cost of production
Step 12: Safety analysis
- Identify hazards:
* Exothermic runaway reactions
* Pressure buildup
* Toxic or flammable materials
- Perform HAZOP-style analysis
- Calculate safe operating limits:
* Maximum temperature of synthesis reaction (MTSR)
* Adiabatic temperature rise
* Relief valve sizing
- Design emergency shutdown systems
- Implement safety interlocks
Step 13: Uncertainty quantification
- Propagate parameter uncertainties from PyMC:
* How does kinetic parameter uncertainty affect yield?
* Monte Carlo simulation with parameter distributions
- Evaluate robustness of optimal design
- Calculate confidence intervals on economic metrics
- Identify critical uncertainties for further study
Step 14: Generate comprehensive engineering report
- Executive summary of project objectives and results
- Process flow diagram (PFD) with material and energy streams
- Reaction kinetics and model equations
- Parameter estimation results with uncertainties
- Optimization results:
* Pareto front for multi-objective optimization
* Recommended operating conditions
* Trade-off analysis
- Dynamic simulation results (startup curves, response to disturbances)
- Control system design and tuning
- Economic analysis with sensitivity to key assumptions
- Safety analysis and hazard mitigation
- Scale-up considerations:
* Pilot to commercial scale
* Heat and mass transfer limitations
* Equipment sizing
- Recommendations:
* Optimal reactor design (size, type, materials of construction)
* Operating conditions for maximum profitability
* Control strategy
* Further experimental studies needed
- Technical drawings and P&ID (piping and instrumentation diagram)
- Export as professional engineering report (PDF)
Expected Output:
- Validated reactor model with parameter uncertainties
- Optimal reactor design and operating conditions
- Pareto-optimal solutions for multi-objective optimization
- Dynamic process simulation results
- Advanced control strategies (RL-based)
- Economic feasibility analysis
- Safety assessment
- Comprehensive chemical engineering design report
Summary
These examples demonstrate:
- Cross-domain applicability: Skills are useful across many scientific fields
- Skill integration: Complex workflows combine multiple databases, packages, and analysis methods
- Real-world relevance: Examples address actual research questions and clinical needs
- End-to-end workflows: From data acquisition to publication-ready reports
- Best practices: QC, statistical rigor, visualization, interpretation, and documentation
How to Use These Examples
- Adapt to your needs: Modify parameters, datasets, and objectives for your specific research question
- Combine skills creatively: Mix and match skills from different categories
- Follow the structure: Each example provides a clear step-by-step workflow
- Generate comprehensive output: Aim for publication-quality figures and professional reports
- Cite your sources: Always verify data and provide proper citations
Additional Notes
- Always start with: "Always use available 'skills' when possible. Keep the output organized."
- For complex projects, break into manageable steps and validate intermediate results
- Save checkpoints and intermediate data files
- Document parameters and decisions for reproducibility
- Generate README files explaining methodology
- Create PDFs for stakeholder communication
These examples showcase the power of combining the skills in this repository to tackle complex, real-world scientific challenges across multiple domains.