# Scientific Packages ## Bioinformatics & Genomics - **AnnData** - Annotated data matrices for single-cell genomics and h5ad files - **Arboreto** - Gene regulatory network inference using GRNBoost2 and GENIE3 - **BioPython** - Sequence manipulation, NCBI database access, BLAST searches, alignments, and phylogenetics - **BioServices** - Programmatic access to 40+ biological web services (KEGG, UniProt, ChEBI, ChEMBL) - **Cellxgene Census** - Query and analyze large-scale single-cell RNA-seq data - **gget** - Efficient genomic database queries (Ensembl, UniProt, NCBI, PDB, COSMIC) - **geniml** - Genomic interval machine learning toolkit providing unsupervised methods for building ML models on BED files. Key capabilities include Region2Vec (word2vec-style embeddings of genomic regions and region sets using tokenization and neural language modeling), BEDspace (joint embeddings of regions and metadata labels using StarSpace for cross-modal queries), scEmbed (Region2Vec applied to single-cell ATAC-seq data generating cell-level embeddings for clustering and annotation with scanpy integration), consensus peak building (four statistical methods CC/CCF/ML/HMM for creating reference universes from BED collections), and comprehensive utilities (BBClient for BED caching, BEDshift for genomic randomization preserving context, evaluation metrics for embedding quality, Text2BedNN for neural search backends). Part of BEDbase ecosystem. Supports Python API and CLI workflows, pre-trained models on Hugging Face, and integration with gtars for tokenization. Use cases: region similarity searches, dimension reduction of chromatin accessibility data, scATAC-seq clustering and cell-type annotation, metadata-aware genomic queries, universe construction for standardized references, and any ML task requiring genomic region feature vectors - **gtars** - High-performance Rust toolkit for genomic interval analysis providing specialized tools for overlap detection using IGD (Integrated Genome Database) indexing, coverage track generation (uniwig module for WIG/BigWig formats), genomic tokenization for machine learning applications (TreeTokenizer for deep learning models), reference sequence management (refget protocol compliance), fragment processing for single-cell genomics (barcode-based splitting and cluster analysis), and fragment scoring against reference datasets. Offers Python bindings with NumPy integration, command-line tools (gtars-cli), and Rust library. Key modules include: tokenizers (convert genomic regions to ML tokens), overlaprs (efficient overlap computation), uniwig (ATAC-seq/ChIP-seq/RNA-seq coverage profiles), refget (GA4GH-compliant sequence digests), bbcache (BEDbase.org integration), scoring (fragment enrichment metrics), and fragsplit (single-cell fragment manipulation). Supports parallel processing, memory-mapped files, streaming for large datasets, and serves as foundation for geniml genomic ML package. Ideal for genomic ML preprocessing, regulatory element analysis, variant annotation, chromatin accessibility profiling, and computational genomics workflows - **pysam** - Read, write, and manipulate genomic data files (SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences) with pileup analysis, coverage calculations, and bioinformatics workflows - **PyDESeq2** - Differential gene expression analysis for bulk RNA-seq data - **Scanpy** - Single-cell RNA-seq analysis with clustering, marker genes, and UMAP/t-SNE visualization - **scvi-tools** - Probabilistic deep learning models for single-cell omics analysis. PyTorch-based framework providing variational autoencoders (VAEs) for dimensionality reduction, batch correction, differential expression, and data integration across modalities. Includes 25+ models: scVI/scANVI (RNA-seq integration and cell type annotation), totalVI (CITE-seq protein+RNA), MultiVI (multiome RNA+ATAC integration), PeakVI (ATAC-seq analysis), DestVI/Stereoscope/Tangram (spatial transcriptomics deconvolution), MethylVI (methylation), CytoVI (flow/mass cytometry), VeloVI (RNA velocity), contrastiveVI (perturbation studies), and Solo (doublet detection). Supports seamless integration with Scanpy/AnnData ecosystem, GPU acceleration, reference mapping (scArches), and probabilistic differential expression with uncertainty quantification ## Data Management & Infrastructure - **LaminDB** - Open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR (Findable, Accessible, Interoperable, Reusable). Provides unified platform combining lakehouse architecture, lineage tracking, feature stores, biological ontologies (via Bionty plugin with 20+ ontologies: genes, proteins, cell types, tissues, diseases, pathways), LIMS, and ELN capabilities through a single Python API. Key features include: automatic data lineage tracking (code, inputs, outputs, environment), versioned artifacts (DataFrame, AnnData, SpatialData, Parquet, Zarr), schema validation and data curation with standardization/synonym mapping, queryable metadata with feature-based filtering, cross-registry traversal, and streaming for large datasets. Supports integrations with workflow managers (Nextflow, Snakemake, Redun), MLOps platforms (Weights & Biases, MLflow, HuggingFace, scVI-tools), cloud storage (S3, GCS, S3-compatible), array stores (TileDB-SOMA, DuckDB), and visualization (Vitessce). Deployment options: local SQLite, cloud storage with SQLite, or cloud storage with PostgreSQL for production. Use cases: scRNA-seq standardization and analysis, flow cytometry/spatial data management, multi-modal dataset integration, computational workflow tracking with reproducibility, biological ontology-based annotation, data lakehouse construction for unified queries, ML pipeline integration with experiment tracking, and FAIR-compliant dataset publishing ## Cheminformatics & Drug Discovery - **Datamol** - Molecular manipulation and featurization with enhanced RDKit workflows - **DeepChem** - Molecular machine learning, graph neural networks, and MoleculeNet benchmarks - **DiffDock** - Diffusion-based molecular docking for protein-ligand binding prediction - **MedChem** - Medicinal chemistry analysis, ADMET prediction, and drug-likeness assessment - **Molfeat** - 100+ molecular featurizers including fingerprints, descriptors, and pretrained models - **PyTDC** - Therapeutics Data Commons for drug discovery datasets and benchmarks - **RDKit** - Cheminformatics toolkit for molecular I/O, descriptors, fingerprints, and SMARTS - **TorchDrug** - PyTorch-based machine learning platform for drug discovery with 40+ datasets, 20+ GNN models for molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, and retrosynthesis planning ## Proteomics & Mass Spectrometry - **matchms** - Processing and similarity matching of mass spectrometry data with 40+ filters, spectral library matching (Cosine, Modified Cosine, Neutral Losses), metadata harmonization, molecular fingerprint comparison, and support for multiple file formats (MGF, MSP, mzML, JSON) - **pyOpenMS** - Comprehensive mass spectrometry data analysis for proteomics and metabolomics (LC-MS/MS processing, peptide identification, feature detection, quantification, chemical calculations, and integration with search engines like Comet, Mascot, MSGF+) ## Medical Imaging & Digital Pathology - **histolab** - Digital pathology toolkit for whole slide image (WSI) processing and analysis. Provides automated tissue detection, tile extraction for deep learning pipelines, and preprocessing for gigapixel histopathology images. Key features include: multi-format WSI support (SVS, TIFF, NDPI), three tile extraction strategies (RandomTiler for sampling, GridTiler for complete coverage, ScoreTiler for quality-driven selection), automated tissue masks with customizable filters, built-in scorers (NucleiScorer, CellularityScorer), pyramidal image handling, visualization tools (thumbnails, mask overlays, tile previews), and H&E stain decomposition. Supports multiple tissue sections, artifact removal, pen annotation exclusion, and reproducible extraction with seeding. Use cases: creating training datasets for computational pathology, extracting informative tiles for tumor classification, whole-slide tissue characterization, quality assessment of histology samples, automated nuclei density analysis, and preprocessing for digital pathology deep learning workflows - **PathML** - Comprehensive computational pathology toolkit for whole slide image analysis, tissue segmentation, and machine learning on pathology data. Provides end-to-end workflows for digital pathology research including data loading, preprocessing, feature extraction, and model deployment - **pydicom** - Pure Python package for working with DICOM (Digital Imaging and Communications in Medicine) files. Provides comprehensive support for reading, writing, and manipulating medical imaging data from CT, MRI, X-ray, ultrasound, PET scans and other modalities. Key features include: pixel data extraction and manipulation with automatic decompression (JPEG/JPEG 2000/RLE), metadata access and modification with 1000+ standardized DICOM tags, image format conversion (PNG/JPEG/TIFF), anonymization tools for removing Protected Health Information (PHI), windowing and display transformations (VOI LUT application), multi-frame and 3D volume processing, DICOM sequence handling, and support for multiple transfer syntaxes. Use cases: medical image analysis, PACS system integration, radiology workflows, research data processing, DICOM anonymization, format conversion, image preprocessing for machine learning, multi-slice volume reconstruction, and clinical imaging pipelines ## Healthcare AI & Clinical Machine Learning - **NeuroKit2** - Comprehensive biosignal processing toolkit for analyzing physiological data including ECG, EEG, EDA, RSP, PPG, EMG, and EOG signals. Use this skill when processing cardiovascular signals, brain activity, electrodermal responses, respiratory patterns, muscle activity, or eye movements. Key features include: automated signal processing pipelines (cleaning, peak detection, delineation, quality assessment), heart rate variability analysis across time/frequency/nonlinear domains (SDNN, RMSSD, LF/HF, DFA, entropy measures), EEG analysis (frequency band power, microstates, source localization), autonomic nervous system assessment (sympathetic indices, respiratory sinus arrhythmia), comprehensive complexity measures (25+ entropy types, 15+ fractal dimensions, Lyapunov exponents), event-related and interval-related analysis modes, epoch creation and averaging for stimulus-locked responses, multi-signal integration with unified workflows, and extensive signal processing utilities (filtering, decomposition, peak correction, spectral analysis). Includes modular reference documentation across 12 specialized domains. Use cases: heart rate variability for cardiovascular health assessment, EEG microstates for consciousness studies, electrodermal activity for emotion research, respiratory variability analysis, psychophysiology experiments, affective computing, stress monitoring, sleep staging, autonomic dysfunction assessment, biofeedback applications, and multi-modal physiological signal integration for comprehensive human state monitoring - **PyHealth** - Comprehensive healthcare AI toolkit for developing, testing, and deploying machine learning models with clinical data. Provides specialized tools for electronic health records (EHR), physiological signals, medical imaging, and clinical text analysis. Key features include: 10+ healthcare datasets (MIMIC-III/IV, eICU, OMOP, sleep EEG, COVID-19 CXR), 20+ predefined clinical prediction tasks (mortality, hospital readmission, length of stay, drug recommendation, sleep staging, EEG analysis), 33+ models (Logistic Regression, MLP, CNN, RNN, Transformer, GNN, plus healthcare-specific models like RETAIN, SafeDrug, GAMENet, StageNet), comprehensive data processing (sequence processors, signal processors, medical code translation between ICD-9/10, NDC, RxNorm, ATC systems), training/evaluation utilities (Trainer class, fairness metrics, calibration, uncertainty quantification), and interpretability tools (attention visualization, SHAP, ChEFER). 3x faster than pandas for healthcare data processing. Use cases: ICU mortality prediction, hospital readmission risk assessment, safe medication recommendation with drug-drug interaction constraints, sleep disorder diagnosis from EEG signals, medical code standardization and translation, clinical text to ICD coding, length of stay estimation, and any clinical ML application requiring interpretability, fairness assessment, and calibrated predictions for healthcare deployment ## Protein Engineering & Design - **ESM (Evolutionary Scale Modeling)** - State-of-the-art protein language models from EvolutionaryScale for protein design, structure prediction, and representation learning. Includes ESM3 (1.4B-98B parameter multimodal generative models for simultaneous reasoning across sequence, structure, and function with chain-of-thought generation, inverse folding, and function-conditioned design) and ESM C (300M-6B parameter efficient embedding models 3x faster than ESM2 for similarity analysis, classification, and feature extraction). Supports local inference with open weights and cloud-based Forge API for scalable batch processing. Use cases: novel protein design, structure prediction from sequence, sequence design from structure, protein embeddings, function annotation, variant generation, and directed evolution workflows ## Machine Learning & Deep Learning - **aeon** - Comprehensive scikit-learn compatible Python toolkit for time series machine learning providing state-of-the-art algorithms across 7 domains: classification (13 algorithm categories including ROCKET variants, deep learning with InceptionTime/ResNet/FCN, distance-based with DTW/ERP/LCSS, shapelet-based, dictionary methods like BOSS/WEASEL, and hybrid ensembles HIVECOTE), regression (9 categories mirroring classification approaches), clustering (k-means/k-medoids with temporal distances, deep learning autoencoders, spectral methods), forecasting (ARIMA, ETS, Theta, Threshold Autoregressive, TCN, DeepAR), anomaly detection (STOMP/MERLIN matrix profile, clustering-based CBLOF/KMeans, isolation methods, copula-based COPOD), segmentation (ClaSP, FLUSS, HMM, binary segmentation), and similarity search (MASS algorithm, STOMP motif discovery, approximate nearest neighbors). Includes 40+ distance metrics (elastic: DTW/DDTW/WDTW/Shape-DTW, edit-based: ERP/EDR/LCSS/TWE/MSM, lock-step: Euclidean/Manhattan), extensive transformations (ROCKET/MiniRocket/MultiRocket for features, Catch22/TSFresh for statistics, SAX/PAA for symbolic representation, shapelet transforms, wavelets, matrix profile), 20+ deep learning architectures (FCN, ResNet, InceptionTime, TCN, autoencoders with attention mechanisms), comprehensive benchmarking tools (UCR/UEA archives with 100+ datasets, published results repository, statistical testing), and performance-optimized implementations using numba. Features progressive model complexity from fast baselines (MiniRocket: <1 second training, 0.95+ accuracy on many benchmarks) to state-of-the-art ensembles (HIVECOTE V2), GPU acceleration support, and extensive visualization utilities. Use cases: physiological signal classification (ECG, EEG), industrial sensor monitoring, financial forecasting, change point detection, pattern discovery, activity recognition from wearables, predictive maintenance, climate time series analysis, and any sequential data requiring specialized temporal modeling beyond standard ML - **PufferLib** - High-performance reinforcement learning library achieving 1M-4M steps/second through optimized vectorization, native multi-agent support, and efficient PPO training (PuffeRL). Use this skill for RL training on any environment (Gymnasium, PettingZoo, Atari, Procgen), creating custom PufferEnv environments, developing policies (CNN, LSTM, multi-input architectures), optimizing parallel simulation performance, or scaling multi-agent systems. Includes Ocean suite (20+ environments), seamless framework integration with automatic space flattening, zero-copy vectorization with shared memory buffers, distributed training support, and comprehensive reference guides for training workflows, environment development, vectorization optimization, policy architectures, and third-party integrations - **PyMC** - Bayesian statistical modeling and probabilistic programming - **PyMOO** - Multi-objective optimization with evolutionary algorithms - **PyTorch Lightning** - Deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. Automates training workflows (40+ tasks including epoch/batch iteration, optimizer steps, gradient management, checkpointing), supports multi-GPU/TPU training with DDP/FSDP/DeepSpeed strategies, includes LightningModule for model organization, Trainer for automation, LightningDataModule for data pipelines, callbacks for extensibility, and integrations with TensorBoard, Wandb, MLflow for experiment tracking - **scikit-learn** - Industry-standard Python library for classical machine learning providing comprehensive supervised learning (classification: Logistic Regression, SVM, Decision Trees, Random Forests with 17+ variants, Gradient Boosting with XGBoost-compatible HistGradientBoosting, Naive Bayes, KNN, Neural Networks/MLP; regression: Linear, Ridge, Lasso, ElasticNet, SVR, ensemble methods), unsupervised learning (clustering: K-Means, DBSCAN, HDBSCAN, OPTICS, Agglomerative/Hierarchical, Spectral, Gaussian Mixture Models, BIRCH, MeanShift; dimensionality reduction: PCA, Kernel PCA, t-SNE, Isomap, LLE, NMF, TruncatedSVD, FastICA, LDA; outlier detection: IsolationForest, LocalOutlierFactor, OneClassSVM), data preprocessing (scaling: StandardScaler, MinMaxScaler, RobustScaler; encoding: OneHotEncoder, OrdinalEncoder, LabelEncoder; imputation: SimpleImputer, KNNImputer, IterativeImputer; feature engineering: PolynomialFeatures, KBinsDiscretizer, text vectorization with CountVectorizer/TfidfVectorizer), model evaluation (cross-validation: KFold, StratifiedKFold, TimeSeriesSplit, GroupKFold; hyperparameter tuning: GridSearchCV, RandomizedSearchCV, HalvingGridSearchCV; metrics: 30+ evaluation metrics for classification/regression/clustering including accuracy, precision, recall, F1, ROC-AUC, MSE, R², silhouette score), and Pipeline/ColumnTransformer for production-ready workflows. Features consistent API (fit/predict/transform), extensive documentation, integration with NumPy/pandas/SciPy, joblib persistence, and scikit-learn-compatible ecosystem (XGBoost, LightGBM, CatBoost, imbalanced-learn). Optimized implementations using Cython/OpenMP for performance. Use cases: predictive modeling, customer segmentation, anomaly detection, feature engineering, model selection/validation, text classification, image classification (with feature extraction), time series forecasting (with preprocessing), medical diagnosis, fraud detection, recommendation systems, and any tabular data ML task requiring interpretable models or established algorithms - **scikit-survival** - Survival analysis and time-to-event modeling with censored data. Built on scikit-learn, provides Cox proportional hazards models (CoxPHSurvivalAnalysis, CoxnetSurvivalAnalysis with elastic net regularization), ensemble methods (Random Survival Forests, Gradient Boosting), Survival Support Vector Machines (linear and kernel), non-parametric estimators (Kaplan-Meier, Nelson-Aalen), competing risks analysis, and specialized evaluation metrics (concordance index, time-dependent AUC, Brier score). Handles right-censored data, integrates with scikit-learn pipelines, and supports feature selection and hyperparameter tuning via cross-validation - **SHAP** - Model interpretability and explainability using Shapley values from game theory. Provides unified approach to explain any ML model with TreeExplainer (fast exact explanations for XGBoost/LightGBM/Random Forest), DeepExplainer (TensorFlow/PyTorch neural networks), KernelExplainer (model-agnostic), and LinearExplainer. Includes comprehensive visualizations (waterfall plots for individual predictions, beeswarm plots for global importance, scatter plots for feature relationships, bar/force/heatmap plots), supports model debugging, fairness analysis, feature engineering guidance, and production deployment - **Stable Baselines3** - PyTorch-based reinforcement learning library providing reliable implementations of RL algorithms (PPO, SAC, DQN, TD3, DDPG, A2C, HER, RecurrentPPO). Use this skill for training RL agents on standard or custom Gymnasium environments, implementing callbacks for monitoring and control, using vectorized environments for parallel training, creating custom environments with proper Gymnasium API implementation, and integrating with deep RL workflows. Includes comprehensive training templates, evaluation utilities, algorithm selection guidance (on-policy vs off-policy, continuous vs discrete actions), support for multi-input policies (dict observations), goal-conditioned learning with HER, and integration with TensorBoard for experiment tracking - **statsmodels** - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics) - **Torch Geometric** - Graph Neural Networks for molecular and geometric data - **Transformers** - State-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Provides 1M+ pre-trained models accessible via pipelines (text-classification, NER, QA, summarization, translation, text-generation, image-classification, object-detection, ASR, VQA), comprehensive training via Trainer API with distributed training and mixed precision, flexible text generation with multiple decoding strategies (greedy, beam search, sampling), and Auto classes for automatic architecture selection (BERT, GPT, T5, ViT, BART, etc.) - **UMAP-learn** - Dimensionality reduction and manifold learning ## Materials Science & Chemistry - **Astropy** - Comprehensive Python library for astronomy and astrophysics providing core functionality for astronomical research and data analysis. Includes coordinate system transformations (ICRS, Galactic, FK5, AltAz), physical units and quantities with automatic dimensional consistency, FITS file operations (reading, writing, manipulating headers and data), cosmological calculations (luminosity distance, lookback time, Hubble parameter, Planck/WMAP models), precise time handling across multiple time scales (UTC, TAI, TT, TDB) and formats (JD, MJD, ISO), table operations with unit support (FITS, CSV, HDF5, VOTable), WCS transformations between pixel and world coordinates, astronomical constants, modeling framework, visualization tools, and statistical functions. Use for celestial coordinate transformations, unit conversions, FITS image/table processing, cosmological distance calculations, barycentric time corrections, catalog cross-matching, and astronomical data analysis - **COBRApy** - Constraint-based metabolic modeling and flux balance analysis - **Pymatgen** - Materials structure analysis, phase diagrams, and electronic structure ## Data Analysis & Visualization - **Dask** - Parallel computing for larger-than-memory datasets with distributed DataFrames, Arrays, Bags, and Futures - **Data Commons** - Programmatic access to public statistical data from global sources including census bureaus, health organizations, and environmental agencies. Provides unified Python API for querying demographic data, economic indicators, health statistics, and environmental datasets through a knowledge graph interface. Features three main endpoints: Observation (statistical time-series queries for population, GDP, unemployment rates, disease prevalence), Node (knowledge graph exploration for entity relationships and hierarchies), and Resolve (entity identification from names, coordinates, or Wikidata IDs). Seamless Pandas integration for DataFrames, relation expressions for hierarchical queries, data source filtering for consistency, and support for custom Data Commons instances - **Matplotlib** - Publication-quality plotting and visualization - **NetworkX** - Comprehensive toolkit for creating, analyzing, and visualizing complex networks and graphs. Supports four graph types (Graph, DiGraph, MultiGraph, MultiDiGraph) with nodes as any hashable objects and rich edge attributes. Provides 100+ algorithms including shortest paths (Dijkstra, Bellman-Ford, A*), centrality measures (degree, betweenness, closeness, eigenvector, PageRank), clustering (coefficients, triangles, transitivity), community detection (modularity-based, label propagation, Girvan-Newman), connectivity analysis (components, cuts, flows), tree algorithms (MST, spanning trees), matching, graph coloring, isomorphism, and traversal (DFS, BFS). Includes 50+ graph generators for classic (complete, cycle, wheel), random (Erdős-Rényi, Barabási-Albert, Watts-Strogatz, stochastic block model), lattice (grid, hexagonal, hypercube), and specialized networks. Supports I/O across formats (edge lists, GraphML, GML, JSON, Pajek, GEXF, DOT) with Pandas/NumPy/SciPy integration. Visualization capabilities include 8+ layout algorithms (spring/force-directed, circular, spectral, Kamada-Kawai), customizable node/edge appearance, interactive visualizations with Plotly/PyVis, and publication-quality figure generation. Use cases: social network analysis, biological networks (protein-protein interactions, gene regulatory networks, metabolic pathways), transportation systems, citation networks, knowledge graphs, web structure analysis, infrastructure networks, and any domain involving pairwise relationships requiring structural analysis or graph-based modeling - **Polars** - High-performance DataFrame operations with lazy evaluation - **Seaborn** - Statistical data visualization with dataset-oriented interface, automatic confidence intervals, publication-quality themes, colorblind-safe palettes, and comprehensive support for exploratory analysis, distribution comparisons, correlation matrices, regression plots, and multi-panel figures - **SimPy** - Process-based discrete-event simulation framework for modeling systems with processes, queues, and resource contention (manufacturing, service operations, network traffic, logistics). Supports generator-based process definition, multiple resource types (Resource, PriorityResource, PreemptiveResource, Container, Store), event-driven scheduling, process interaction mechanisms (signaling, interruption, parallel/sequential execution), real-time simulation synchronized with wall-clock time, and comprehensive monitoring capabilities for utilization, wait times, and queue statistics - **SymPy** - Symbolic mathematics in Python for exact computation using mathematical symbols rather than numerical approximations. Provides comprehensive support for symbolic algebra (simplification, expansion, factorization), calculus (derivatives, integrals, limits, series), equation solving (algebraic, differential, systems of equations), matrices and linear algebra (eigenvalues, decompositions, solving linear systems), physics (classical mechanics with Lagrangian/Hamiltonian formulations, quantum mechanics, vector analysis, units), number theory (primes, factorization, modular arithmetic, Diophantine equations), geometry (2D/3D analytic geometry), combinatorics (permutations, combinations, partitions, group theory), logic and sets, statistics (probability distributions, random variables), special functions (gamma, Bessel, orthogonal polynomials), and code generation (lambdify to NumPy/SciPy functions, C/Fortran code generation, LaTeX output for documentation). Emphasizes exact arithmetic using rational numbers and symbolic representations, supports assumptions for improved simplification (positive, real, integer), integrates seamlessly with NumPy/SciPy through lambdify for fast numerical evaluation, and enables symbolic-to-numeric pipelines for scientific computing workflows - **Vaex** - High-performance Python library for lazy, out-of-core DataFrames to process and visualize tabular datasets larger than available RAM. Processes over a billion rows per second through memory-mapped files (HDF5, Apache Arrow), lazy evaluation, and virtual columns (zero memory overhead). Provides instant file opening, efficient aggregations across billions of rows, interactive visualizations without sampling, machine learning pipelines with transformers (scalers, encoders, PCA), and seamless integration with pandas/NumPy/Arrow. Includes comprehensive ML framework (vaex.ml) with feature scaling, categorical encoding, dimensionality reduction, and integration with scikit-learn/XGBoost/LightGBM/CatBoost. Supports distributed computing via Dask, asynchronous operations, and state management for production deployment. Use cases: processing gigabyte to terabyte datasets, fast statistical aggregations on massive data, visualizing billion-row datasets, ML pipelines on big data, converting between data formats, and working with astronomical, financial, or scientific large-scale datasets - **ReportLab** - Programmatic PDF generation for reports and documents ## Phylogenetics & Trees - **ETE Toolkit** - Phylogenetic tree manipulation, visualization, and analysis ## Genomics Tools - **deepTools** - NGS data analysis (ChIP-seq, RNA-seq, ATAC-seq) with BAM/bigWig files - **FlowIO** - Flow Cytometry Standard (FCS) file reading and manipulation - **scikit-bio** - Bioinformatics sequence analysis and diversity metrics - **Zarr** - Chunked, compressed N-dimensional array storage ## Multi-omics & AI Agent Frameworks - **BIOMNI** - Autonomous biomedical AI agent framework from Stanford SNAP lab for executing complex research tasks across genomics, drug discovery, molecular biology, and clinical analysis. Combines LLM reasoning with code execution and ~11GB of integrated biomedical databases (Ensembl, NCBI Gene, UniProt, PDB, AlphaFold, ClinVar, OMIM, HPO, PubMed, KEGG, Reactome, GO). Supports multiple LLM providers (Claude, GPT-4, Gemini, Groq, Bedrock). Includes A1 agent class for autonomous task decomposition, BiomniEval1 benchmark framework, and MCP server integration. Use cases: CRISPR screening design, single-cell RNA-seq analysis, ADMET prediction, GWAS interpretation, rare disease diagnosis, protein structure analysis, literature synthesis, and multi-omics integration - **Denario** - Multiagent AI system for scientific research assistance that automates complete research workflows from data analysis through publication. Built on AG2 and LangGraph frameworks, orchestrates specialized agents for hypothesis generation, methodology development, computational analysis, and LaTeX paper writing. Supports multiple LLM providers (Google Vertex AI, OpenAI) with flexible pipeline stages allowing manual or automated inputs. Key features include: end-to-end research automation (data description → idea generation → methodology → results → paper), journal-specific formatting (APS and others), GUI interface via Streamlit, Docker deployment with LaTeX environment, reproducible research with version-controlled outputs, literature search integration, and integration with scientific Python stack (pandas, sklearn, scipy). Provides both programmatic Python API and web-based interface. Use cases: automated hypothesis generation from datasets, research methodology development, computational experiment execution with visualization, publication-ready manuscript generation, time-series analysis research, machine learning experiment automation, and accelerating the complete scientific research lifecycle from ideation to publication - **HypoGeniC** - Automated hypothesis generation and testing using large language models to accelerate scientific discovery. Provides three frameworks: HypoGeniC (data-driven hypothesis generation from observational data), HypoRefine (synergistic approach combining literature insights with empirical patterns through an agentic system), and Union methods (mechanistic combination of literature and data-driven hypotheses). Features iterative refinement that improves hypotheses by learning from challenging examples, Redis caching for API cost reduction, and customizable YAML-based prompt templates. Includes command-line tools for generation (hypogenic_generation) and testing (hypogenic_inference). Research applications have demonstrated 14.19% accuracy improvement in AI-content detection and 7.44% in deception detection. Use cases: deception detection in reviews, AI-generated content identification, mental stress detection, exploratory research without existing literature, hypothesis-driven analysis in novel domains, and systematic exploration of competing explanations ## Scientific Communication & Publishing - **Paper-2-Web** - Autonomous pipeline for transforming academic papers into multiple promotional formats using the Paper2All system. Converts LaTeX or PDF papers into: (1) Paper2Web - interactive, layout-aware academic homepages with responsive design, interactive figures, and mobile support; (2) Paper2Video - professional presentation videos with slides, narration, cursor movements, and optional talking-head generation using Hallo2; (3) Paper2Poster - print-ready conference posters with custom dimensions, professional layouts, and institution branding. Supports GPT-4/GPT-4.1 models, batch processing, QR code generation, multi-language content, and quality assessment metrics. Use cases: conference materials, video abstracts, preprint enhancement, research promotion, poster sessions, and academic website creation ## Document Processing & Conversion - **MarkItDown** - Python utility for converting 20+ file formats to Markdown optimized for LLM processing. Converts Office documents (PDF, DOCX, PPTX, XLSX), images with OCR, audio with transcription, web content (HTML, YouTube transcripts, EPUB), and structured data (CSV, JSON, XML) while preserving document structure (headings, lists, tables, hyperlinks). Key features include: Azure Document Intelligence integration for enhanced PDF table extraction, LLM-powered image descriptions using GPT-4o, batch processing with ZIP archive support, modular installation for specific formats, streaming approach without temporary files, and plugin system for custom converters. Supports Python 3.10+. Use cases: preparing documents for RAG systems, extracting text from PDFs and Office files, transcribing audio to text, performing OCR on images and scanned documents, converting YouTube videos to searchable text, processing HTML and EPUB books, converting structured data to readable format, document analysis pipelines, and LLM training data preparation ## Laboratory Automation & Equipment Control - **PyLabRobot** - Hardware-agnostic, pure Python SDK for automated and autonomous laboratories. Provides unified interface for controlling liquid handling robots (Hamilton STAR/STARlet, Opentrons OT-2, Tecan EVO), plate readers (BMG CLARIOstar), heater shakers, incubators, centrifuges, pumps, and scales. Key features include: modular resource management system for plates, tips, and containers with hierarchical deck layouts and JSON serialization; comprehensive liquid handling operations (aspirate, dispense, transfer, serial dilutions, plate replication) with automatic tip and volume tracking; backend abstraction enabling hardware-agnostic protocols that work across different robots; ChatterboxBackend for protocol simulation and testing without hardware; browser-based visualizer for real-time 3D deck state visualization; cross-platform support (Windows, macOS, Linux, Raspberry Pi); and integration capabilities for multi-device workflows combining liquid handlers, analytical equipment, and material handling devices. Use cases: automated sample preparation, high-throughput screening, serial dilution protocols, plate reading workflows, laboratory protocol development and validation, robotic liquid handling automation, and reproducible laboratory automation with state tracking and persistence ## Tool Discovery & Research Platforms - **ToolUniverse** - Unified ecosystem providing standardized access to 600+ scientific tools, models, datasets, and APIs across bioinformatics, cheminformatics, genomics, structural biology, and proteomics. Enables AI agents to function as research scientists through: (1) Tool Discovery - natural language, semantic, and keyword-based search for finding relevant scientific tools (Tool_Finder, Tool_Finder_LLM, Tool_Finder_Keyword); (2) Tool Execution - standardized AI-Tool Interaction Protocol for running tools with consistent interfaces; (3) Tool Composition - sequential and parallel workflow chaining for multi-step research pipelines; (4) Model Context Protocol (MCP) integration for Claude Desktop/Code. Supports drug discovery workflows (disease→targets→structures→screening→candidates), genomics analysis (expression→differential analysis→pathways), clinical genomics (variants→annotation→pathogenicity→disease associations), and cross-domain research. Use cases: accessing scientific databases (OpenTargets, PubChem, UniProt, PDB, ChEMBL, KEGG), protein structure prediction (AlphaFold), molecular docking, pathway enrichment, variant annotation, literature searches, and automated scientific workflows