Files
claude-scientific-skills/docs/scientific-packages.md
2025-10-21 10:19:15 -07:00

5.0 KiB

Scientific Packages

Bioinformatics & Genomics

  • AnnData - Annotated data matrices for single-cell genomics and h5ad files
  • Arboreto - Gene regulatory network inference using GRNBoost2 and GENIE3
  • BioPython - Sequence manipulation, NCBI database access, BLAST searches, alignments, and phylogenetics
  • BioServices - Programmatic access to 40+ biological web services (KEGG, UniProt, ChEBI, ChEMBL)
  • Cellxgene Census - Query and analyze large-scale single-cell RNA-seq data
  • gget - Efficient genomic database queries (Ensembl, UniProt, NCBI, PDB, COSMIC)
  • pysam - Read, write, and manipulate genomic data files (SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences) with pileup analysis, coverage calculations, and bioinformatics workflows
  • PyDESeq2 - Differential gene expression analysis for bulk RNA-seq data
  • Scanpy - Single-cell RNA-seq analysis with clustering, marker genes, and UMAP/t-SNE visualization

Cheminformatics & Drug Discovery

  • Datamol - Molecular manipulation and featurization with enhanced RDKit workflows
  • DeepChem - Molecular machine learning, graph neural networks, and MoleculeNet benchmarks
  • DiffDock - Diffusion-based molecular docking for protein-ligand binding prediction
  • MedChem - Medicinal chemistry analysis, ADMET prediction, and drug-likeness assessment
  • Molfeat - 100+ molecular featurizers including fingerprints, descriptors, and pretrained models
  • PyTDC - Therapeutics Data Commons for drug discovery datasets and benchmarks
  • RDKit - Cheminformatics toolkit for molecular I/O, descriptors, fingerprints, and SMARTS
  • TorchDrug - PyTorch-based machine learning platform for drug discovery with 40+ datasets, 20+ GNN models for molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, and retrosynthesis planning

Proteomics & Mass Spectrometry

  • matchms - Processing and similarity matching of mass spectrometry data with 40+ filters, spectral library matching (Cosine, Modified Cosine, Neutral Losses), metadata harmonization, molecular fingerprint comparison, and support for multiple file formats (MGF, MSP, mzML, JSON)
  • pyOpenMS - Comprehensive mass spectrometry data analysis for proteomics and metabolomics (LC-MS/MS processing, peptide identification, feature detection, quantification, chemical calculations, and integration with search engines like Comet, Mascot, MSGF+)

Machine Learning & Deep Learning

  • PyMC - Bayesian statistical modeling and probabilistic programming
  • PyMOO - Multi-objective optimization with evolutionary algorithms
  • PyTorch Lightning - Deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. Automates training workflows (40+ tasks including epoch/batch iteration, optimizer steps, gradient management, checkpointing), supports multi-GPU/TPU training with DDP/FSDP/DeepSpeed strategies, includes LightningModule for model organization, Trainer for automation, LightningDataModule for data pipelines, callbacks for extensibility, and integrations with TensorBoard, Wandb, MLflow for experiment tracking
  • scikit-learn - Machine learning algorithms, preprocessing, and model selection
  • statsmodels - Statistical modeling and econometrics (OLS, GLM, logit/probit, ARIMA, time series forecasting, hypothesis testing, diagnostics)
  • Torch Geometric - Graph Neural Networks for molecular and geometric data
  • Transformers - Hugging Face transformers for NLU, image classification, and generation
  • UMAP-learn - Dimensionality reduction and manifold learning

Materials Science & Chemistry

  • Astropy - Astronomy and astrophysics (coordinates, cosmology, FITS files)
  • COBRApy - Constraint-based metabolic modeling and flux balance analysis
  • Pymatgen - Materials structure analysis, phase diagrams, and electronic structure

Data Analysis & Visualization

  • Dask - Parallel computing for larger-than-memory datasets with distributed DataFrames, Arrays, Bags, and Futures
  • Matplotlib - Publication-quality plotting and visualization
  • Polars - High-performance DataFrame operations with lazy evaluation
  • Seaborn - Statistical data visualization with dataset-oriented interface, automatic confidence intervals, publication-quality themes, colorblind-safe palettes, and comprehensive support for exploratory analysis, distribution comparisons, correlation matrices, regression plots, and multi-panel figures
  • ReportLab - Programmatic PDF generation for reports and documents

Phylogenetics & Trees

  • ETE Toolkit - Phylogenetic tree manipulation, visualization, and analysis

Genomics Tools

  • deepTools - NGS data analysis (ChIP-seq, RNA-seq, ATAC-seq) with BAM/bigWig files
  • FlowIO - Flow Cytometry Standard (FCS) file reading and manipulation
  • scikit-bio - Bioinformatics sequence analysis and diversity metrics
  • Zarr - Chunked, compressed N-dimensional array storage

Multi-omics & Integration

  • BIOMNI - Multi-omics data integration with LLM-powered analysis