# Scientific Packages ## Bioinformatics & Genomics - **AnnData** - Annotated data matrices for single-cell genomics and h5ad files - **Arboreto** - Gene regulatory network inference using GRNBoost2 and GENIE3 - **BioPython** - Sequence manipulation, NCBI database access, BLAST searches, alignments, and phylogenetics - **BioServices** - Programmatic access to 40+ biological web services (KEGG, UniProt, ChEBI, ChEMBL) - **Cellxgene Census** - Query and analyze large-scale single-cell RNA-seq data - **gget** - Efficient genomic database queries (Ensembl, UniProt, NCBI, PDB, COSMIC) - **pysam** - Read, write, and manipulate genomic data files (SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences) with pileup analysis, coverage calculations, and bioinformatics workflows - **PyDESeq2** - Differential gene expression analysis for bulk RNA-seq data - **Scanpy** - Single-cell RNA-seq analysis with clustering, marker genes, and UMAP/t-SNE visualization ## Cheminformatics & Drug Discovery - **Datamol** - Molecular manipulation and featurization with enhanced RDKit workflows - **DeepChem** - Molecular machine learning, graph neural networks, and MoleculeNet benchmarks - **DiffDock** - Diffusion-based molecular docking for protein-ligand binding prediction - **MedChem** - Medicinal chemistry analysis, ADMET prediction, and drug-likeness assessment - **Molfeat** - 100+ molecular featurizers including fingerprints, descriptors, and pretrained models - **PyTDC** - Therapeutics Data Commons for drug discovery datasets and benchmarks - **RDKit** - Cheminformatics toolkit for molecular I/O, descriptors, fingerprints, and SMARTS - **TorchDrug** - PyTorch-based machine learning platform for drug discovery with 40+ datasets, 20+ GNN models for molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, and retrosynthesis planning ## Proteomics & Mass Spectrometry - **matchms** - Processing and similarity matching of mass spectrometry data with 40+ filters, spectral library matching (Cosine, Modified Cosine, Neutral Losses), metadata harmonization, molecular fingerprint comparison, and support for multiple file formats (MGF, MSP, mzML, JSON) - **pyOpenMS** - Comprehensive mass spectrometry data analysis for proteomics and metabolomics (LC-MS/MS processing, peptide identification, feature detection, quantification, chemical calculations, and integration with search engines like Comet, Mascot, MSGF+) ## Machine Learning & Deep Learning - **PyMC** - Bayesian statistical modeling and probabilistic programming - **PyMOO** - Multi-objective optimization with evolutionary algorithms - **PyTorch Lightning** - Structured PyTorch training with automatic optimization - **scikit-learn** - Machine learning algorithms, preprocessing, and model selection - **statsmodels** - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics) - **Torch Geometric** - Graph Neural Networks for molecular and geometric data - **Transformers** - Hugging Face transformers for NLU, image classification, and generation - **UMAP-learn** - Dimensionality reduction and manifold learning ## Materials Science & Chemistry - **Astropy** - Astronomy and astrophysics (coordinates, cosmology, FITS files) - **COBRApy** - Constraint-based metabolic modeling and flux balance analysis - **Pymatgen** - Materials structure analysis, phase diagrams, and electronic structure ## Data Analysis & Visualization - **Dask** - Parallel computing for larger-than-memory datasets with distributed DataFrames, Arrays, Bags, and Futures - **Matplotlib** - Publication-quality plotting and visualization - **Polars** - High-performance DataFrame operations with lazy evaluation - **Seaborn** - Statistical data visualization with dataset-oriented interface, automatic confidence intervals, publication-quality themes, colorblind-safe palettes, and comprehensive support for exploratory analysis, distribution comparisons, correlation matrices, regression plots, and multi-panel figures - **ReportLab** - Programmatic PDF generation for reports and documents ## Phylogenetics & Trees - **ETE Toolkit** - Phylogenetic tree manipulation, visualization, and analysis ## Genomics Tools - **deepTools** - NGS data analysis (ChIP-seq, RNA-seq, ATAC-seq) with BAM/bigWig files - **FlowIO** - Flow Cytometry Standard (FCS) file reading and manipulation - **scikit-bio** - Bioinformatics sequence analysis and diversity metrics - **Zarr** - Chunked, compressed N-dimensional array storage ## Multi-omics & Integration - **BIOMNI** - Multi-omics data integration with LLM-powered analysis