mirror of
https://github.com/K-Dense-AI/claude-scientific-skills.git
synced 2026-03-27 07:09:27 +08:00
Apply best practices
This commit is contained in:
@@ -7,7 +7,7 @@ description: "Manipulate AnnData objects for single-cell genomics. Load/save .h5
|
||||
|
||||
## Overview
|
||||
|
||||
AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. This skill provides comprehensive guidance for working with AnnData objects, including data creation, manipulation, file I/O, concatenation, and best practices for memory-efficient workflows.
|
||||
AnnData (Annotated Data) is Python's standard for storing and manipulating annotated data matrices, particularly in single-cell genomics. Work with AnnData objects for data creation, manipulation, file I/O, concatenation, and memory-efficient workflows.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ Arboreto is a Python library for inferring gene regulatory networks (GRNs) from
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when:
|
||||
This skill should be used when:
|
||||
- Inferring regulatory relationships between genes from expression data
|
||||
- Analyzing single-cell or bulk RNA-seq data to identify transcription factor targets
|
||||
- Building the GRN inference component of a pySCENIC pipeline
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "AI agent for autonomous biomedical task execution. CRISPR design,
|
||||
|
||||
## Overview
|
||||
|
||||
Biomni is a general-purpose biomedical AI agent that autonomously executes research tasks across diverse biomedical subfields. It combines large language model reasoning with retrieval-augmented planning and code-based execution to enhance scientific productivity and hypothesis generation. The system operates with an ~11GB biomedical knowledge base covering molecular, genomic, and clinical domains.
|
||||
Biomni is a general-purpose biomedical AI agent that autonomously executes research tasks across diverse biomedical subfields. Use Biomni to combine large language model reasoning with retrieval-augmented planning and code-based execution for scientific productivity and hypothesis generation. The system operates with an ~11GB biomedical knowledge base covering molecular, genomic, and clinical domains.
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
---
|
||||
name: biopython
|
||||
description: Work with Biopython for computational molecular biology tasks including sequence manipulation, file I/O, alignment analysis, BLAST searches, database access (NCBI/Entrez), protein structure analysis (PDB), phylogenetic tree operations, motif finding, population genetics, and other bioinformatics workflows. This skill should be used when working with biological sequences (DNA, RNA, protein), parsing biological file formats (FASTA, GenBank, FASTQ, PDB, etc.), accessing biological databases, running sequence analyses, or performing structural bioinformatics tasks.
|
||||
description: "Toolkit for computational molecular biology. Manipulate sequences, parse biological file formats (FASTA, GenBank, FASTQ, PDB), access NCBI databases, run BLAST, analyze structures, build phylogenetic trees, for bioinformatics workflows."
|
||||
---
|
||||
|
||||
# Biopython: Computational Molecular Biology in Python
|
||||
|
||||
@@ -7,11 +7,11 @@ description: "Access 40+ bio web services (UniProt, KEGG, ChEMBL, PubChem, BLAST
|
||||
|
||||
## Overview
|
||||
|
||||
BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Use this skill to retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
|
||||
BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when tasks involve:
|
||||
This skill should be used when:
|
||||
- Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam
|
||||
- Analyzing metabolic pathways and gene functions via KEGG or Reactome
|
||||
- Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information
|
||||
|
||||
@@ -18,7 +18,7 @@ The Census includes:
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when tasks involve:
|
||||
This skill should be used when:
|
||||
- Querying single-cell expression data by cell type, tissue, or disease
|
||||
- Exploring available single-cell datasets and metadata
|
||||
- Training machine learning models on single-cell data
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Constraint-based metabolic modeling (COBRA). FBA, FVA, gene knocko
|
||||
|
||||
## Overview
|
||||
|
||||
COBRApy is a Python library for constraint-based reconstruction and analysis (COBRA) of metabolic models, essential for systems biology research. Use this skill to work with genome-scale metabolic models, perform computational simulations of cellular metabolism, conduct metabolic engineering analyses, and predict phenotypic behaviors.
|
||||
COBRApy is a Python library for constraint-based reconstruction and analysis (COBRA) of metabolic models, essential for systems biology research. Work with genome-scale metabolic models, perform computational simulations of cellular metabolism, conduct metabolic engineering analyses, and predict phenotypic behaviors.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -14,9 +14,9 @@ Dask is a Python library for parallel and distributed computing that enables thr
|
||||
|
||||
Dask scales from laptops (processing ~100 GiB) to clusters (processing ~100 TiB) while maintaining familiar Python APIs.
|
||||
|
||||
## When to Use Dask
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when users need to:
|
||||
This skill should be used when:
|
||||
- Process datasets that exceed available RAM
|
||||
- Scale pandas or NumPy operations to larger datasets
|
||||
- Parallelize computations for performance improvements
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Pythonic RDKit wrapper for cheminformatics. SMILES parsing, standa
|
||||
|
||||
## Overview
|
||||
|
||||
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. It simplifies complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
|
||||
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
|
||||
|
||||
**Key capabilities**:
|
||||
- Molecular format conversion (SMILES, SELFIES, InChI)
|
||||
|
||||
@@ -11,7 +11,7 @@ DeepChem is a comprehensive Python library for applying machine learning to chem
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when:
|
||||
This skill should be used when:
|
||||
- Loading and processing molecular data (SMILES strings, SDF files, protein sequences)
|
||||
- Predicting molecular properties (solubility, toxicity, binding affinity, ADMET properties)
|
||||
- Training models on chemical/biological datasets
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "NGS analysis toolkit. BAM to bigWig conversion, QC (correlation, P
|
||||
|
||||
## Overview
|
||||
|
||||
deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. This skill provides guidance for using deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
|
||||
deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. Use deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
|
||||
|
||||
**Core capabilities:**
|
||||
- Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph)
|
||||
@@ -18,7 +18,7 @@ deepTools is a comprehensive suite of Python command-line tools designed for pro
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when users request tasks involving:
|
||||
This skill should be used when:
|
||||
|
||||
- **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data"
|
||||
- **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis"
|
||||
|
||||
@@ -18,9 +18,9 @@ DiffDock is a diffusion-based deep learning tool for molecular docking that pred
|
||||
|
||||
**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
|
||||
|
||||
## When to Use DiffDock
|
||||
## When to Use This Skill
|
||||
|
||||
Invoke this skill when users request:
|
||||
This skill should be used when:
|
||||
|
||||
- "Dock this ligand to a protein" or "predict binding pose"
|
||||
- "Run molecular docking" or "perform protein-ligand docking"
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Phylogenetic tree toolkit (ETE). Tree manipulation (Newick/NHX), e
|
||||
|
||||
## Overview
|
||||
|
||||
Provide comprehensive support for phylogenetic and hierarchical tree analysis using the ETE (Environment for Tree Exploration) toolkit. Enable tree manipulation, evolutionary analysis, visualization, and integration with biological databases for phylogenomic research and clustering analysis.
|
||||
ETE (Environment for Tree Exploration) is a toolkit for phylogenetic and hierarchical tree analysis. Manipulate trees, analyze evolutionary events, visualize results, and integrate with biological databases for phylogenomic research and clustering analysis.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -7,11 +7,11 @@ description: "Parse FCS (Flow Cytometry Standard) files v2.0-3.1. Extract events
|
||||
|
||||
## Overview
|
||||
|
||||
FlowIO is a lightweight Python library for reading and writing Flow Cytometry Standard (FCS) files. It excels at parsing FCS metadata, extracting event data, and creating new FCS files with minimal dependencies. The library supports FCS versions 2.0, 3.0, and 3.1, making it ideal for backend services, data pipelines, and basic cytometry file operations.
|
||||
FlowIO is a lightweight Python library for reading and writing Flow Cytometry Standard (FCS) files. Parse FCS metadata, extract event data, and create new FCS files with minimal dependencies. The library supports FCS versions 2.0, 3.0, and 3.1, making it ideal for backend services, data pipelines, and basic cytometry file operations.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when working with:
|
||||
This skill should be used when:
|
||||
|
||||
- FCS files requiring parsing or metadata extraction
|
||||
- Flow cytometry data needing conversion to NumPy arrays
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Bioinformatics query toolkit. Gene info (Ensembl/UniProt), BLAST,
|
||||
|
||||
## Overview
|
||||
|
||||
gget is a command-line bioinformatics tool and Python package providing unified access to 20+ genomic databases and analysis methods. Execute queries for gene information, sequence analysis, protein structures, expression data, and disease associations through a consistent interface. All gget modules work both as command-line tools and as Python functions.
|
||||
gget is a command-line bioinformatics tool and Python package providing unified access to 20+ genomic databases and analysis methods. Query gene information, sequence analysis, protein structures, expression data, and disease associations through a consistent interface. All gget modules work both as command-line tools and as Python functions.
|
||||
|
||||
**Important**: The databases queried by gget are continuously updated, which sometimes changes their structure. gget modules are tested automatically on a biweekly basis and updated to match new database structures when necessary.
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Mass spectrometry analysis. Process mzML/MGF/MSP, spectral similar
|
||||
|
||||
## Overview
|
||||
|
||||
Matchms is an open-source Python library for mass spectrometry data processing and analysis. It provides tools for importing spectra from various formats, standardizing metadata, filtering peaks, calculating spectral similarities, and building reproducible analytical workflows. The library democratizes mass spectrometry informatics through accessible, standardized Python tools.
|
||||
Matchms is an open-source Python library for mass spectrometry data processing and analysis. Import spectra from various formats, standardize metadata, filter peaks, calculate spectral similarities, and build reproducible analytical workflows.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -11,7 +11,7 @@ Matplotlib is Python's foundational visualization library for creating static, a
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when:
|
||||
This skill should be used when:
|
||||
- Creating any type of plot or chart (line, scatter, bar, histogram, heatmap, contour, etc.)
|
||||
- Generating scientific or statistical visualizations
|
||||
- Customizing plot appearance (colors, styles, labels, legends)
|
||||
|
||||
@@ -7,9 +7,17 @@ description: "Medicinal chemistry filters. Apply drug-likeness rules (Lipinski,
|
||||
|
||||
## Overview
|
||||
|
||||
Medchem is a Python library for molecular filtering and prioritization in drug discovery workflows. It provides hundreds of well-established and novel molecular filters, structural alerts, and medicinal chemistry rules to efficiently triage and prioritize compound libraries at scale.
|
||||
Medchem is a Python library for molecular filtering and prioritization in drug discovery workflows. Apply hundreds of well-established and novel molecular filters, structural alerts, and medicinal chemistry rules to efficiently triage and prioritize compound libraries at scale. Rules and filters are context-specific—use as guidelines combined with domain expertise.
|
||||
|
||||
**Key Principle:** Rules and filters are always context-specific. Avoid blindly applying filters—marketed drugs often don't pass standard medchem filters, and prodrugs may intentionally violate rules. Use these tools as guidelines combined with domain expertise.
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Applying drug-likeness rules (Lipinski, Veber, etc.) to compound libraries
|
||||
- Filtering molecules by structural alerts or PAINS patterns
|
||||
- Prioritizing compounds for lead optimization
|
||||
- Assessing compound quality and medicinal chemistry properties
|
||||
- Detecting reactive or problematic functional groups
|
||||
- Calculating molecular complexity metrics
|
||||
|
||||
## Installation
|
||||
|
||||
|
||||
@@ -7,18 +7,11 @@ description: "Molecular featurization for ML (100+ featurizers). ECFP, MACCS, de
|
||||
|
||||
## Overview
|
||||
|
||||
Molfeat is a comprehensive Python library for molecular featurization that unifies pre-trained embeddings and hand-crafted featurizers into a single, fast, and user-friendly package. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations suitable for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications.
|
||||
|
||||
**Key Capabilities:**
|
||||
- 100+ featurizers including fingerprints, descriptors, and pretrained models
|
||||
- Fast parallel processing with simple API
|
||||
- Scikit-learn compatible transformers
|
||||
- Built-in caching and state persistence
|
||||
- Integration with PyTorch, TensorFlow, and graph neural networks
|
||||
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply molfeat when working with:
|
||||
This skill should be used when working with:
|
||||
- **Molecular machine learning**: Building QSAR/QSPR models, property prediction
|
||||
- **Virtual screening**: Ranking compound libraries for biological activity
|
||||
- **Similarity searching**: Finding structurally similar molecules
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Fast DataFrame library (Apache Arrow). Select, filter, group_by, j
|
||||
|
||||
## Overview
|
||||
|
||||
Polars is a lightning-fast DataFrame library for Python (and Rust) built on Apache Arrow. This skill provides guidance for working with Polars, including its expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities. Use this skill when helping users write efficient data processing code, migrate from pandas, or optimize data pipelines.
|
||||
Polars is a lightning-fast DataFrame library for Python and Rust built on Apache Arrow. Work with Polars' expression-based API, lazy evaluation framework, and high-performance data manipulation capabilities for efficient data processing, pandas migration, and data pipeline optimization.
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
||||
@@ -7,15 +7,7 @@ description: "Differential gene expression analysis (Python DESeq2). Identify DE
|
||||
|
||||
## Overview
|
||||
|
||||
PyDESeq2 is a Python implementation of the DESeq2 method for differential expression analysis (DEA) with bulk RNA-seq data. This skill provides comprehensive support for designing and executing PyDESeq2 workflows, from data loading through result interpretation.
|
||||
|
||||
**Key capabilities:**
|
||||
- Single-factor and multi-factor experimental designs
|
||||
- Statistical testing using Wald tests with multiple testing correction
|
||||
- Optional apeGLM log-fold-change shrinkage
|
||||
- Data preprocessing and quality control
|
||||
- Result export and visualization
|
||||
- Integration with pandas, AnnData, and the Python data science ecosystem
|
||||
PyDESeq2 is a Python implementation of DESeq2 for differential expression analysis with bulk RNA-seq data. Design and execute complete workflows from data loading through result interpretation, including single-factor and multi-factor designs, Wald tests with multiple testing correction, optional apeGLM shrinkage, and integration with pandas and AnnData.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
|
||||
@@ -7,15 +7,7 @@ description: "Materials science toolkit. Crystal structures (CIF, POSCAR), phase
|
||||
|
||||
## Overview
|
||||
|
||||
Pymatgen is a comprehensive Python library for materials analysis that powers the Materials Project. This skill provides guidance for using pymatgen's extensive capabilities in computational materials science, including:
|
||||
|
||||
- **Structure manipulation**: Creating, reading, writing, and transforming crystal structures and molecules
|
||||
- **Materials analysis**: Symmetry, coordination environments, bonding, and structure comparison
|
||||
- **Thermodynamics**: Phase diagrams, Pourbaix diagrams, reaction energies, and stability analysis
|
||||
- **Electronic structure**: Band structures, density of states, and Fermi surfaces
|
||||
- **Surfaces and interfaces**: Slab generation, Wulff shapes, adsorption sites, and interface construction
|
||||
- **Materials Project integration**: Programmatic access to hundreds of thousands of computed materials
|
||||
- **File I/O**: Support for 100+ file formats from various computational codes
|
||||
Pymatgen is a comprehensive Python library for materials analysis that powers the Materials Project. Create, analyze, and manipulate crystal structures and molecules, compute phase diagrams and thermodynamic properties, analyze electronic structure (band structures, DOS), generate surfaces and interfaces, and access Materials Project's database of computed materials. Supports 100+ file formats from various computational codes.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Bayesian modeling with PyMC. Build hierarchical models, MCMC (NUTS
|
||||
|
||||
## Overview
|
||||
|
||||
PyMC is a Python library for Bayesian modeling and probabilistic programming. This skill provides comprehensive guidance for building, fitting, validating, and comparing Bayesian models using PyMC's modern API (version 5.x+). It includes workflows for common model types, diagnostic procedures, and best practices for Bayesian inference.
|
||||
PyMC is a Python library for Bayesian modeling and probabilistic programming. Build, fit, validate, and compare Bayesian models using PyMC's modern API (version 5.x+), including hierarchical models, MCMC sampling (NUTS), variational inference, and model comparison (LOO, WAIC).
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
|
||||
@@ -7,11 +7,11 @@ description: "Multi-objective optimization framework. NSGA-II, NSGA-III, MOEA/D,
|
||||
|
||||
## Overview
|
||||
|
||||
Pymoo is a comprehensive Python framework for solving optimization problems with emphasis on multi-objective optimization. The library provides state-of-the-art single-objective and multi-objective algorithms, extensive benchmark problems, customizable genetic operators, advanced visualization tools, and multi-criteria decision making methods. Pymoo excels at finding trade-off solutions (Pareto fronts) for problems with conflicting objectives.
|
||||
Pymoo is a comprehensive Python framework for optimization with emphasis on multi-objective problems. Solve single and multi-objective optimization using state-of-the-art algorithms (NSGA-II/III, MOEA/D), benchmark problems (ZDT, DTLZ), customizable genetic operators, and multi-criteria decision making methods. Excels at finding trade-off solutions (Pareto fronts) for problems with conflicting objectives.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Apply this skill when:
|
||||
This skill should be used when:
|
||||
- Solving optimization problems with one or multiple objectives
|
||||
- Finding Pareto-optimal solutions and analyzing trade-offs
|
||||
- Implementing evolutionary algorithms (GA, DE, PSO, NSGA-II/III)
|
||||
|
||||
@@ -7,9 +7,17 @@ description: "Mass spectrometry toolkit (OpenMS Python). Process mzML/mzXML, pea
|
||||
|
||||
## Overview
|
||||
|
||||
pyOpenMS is an open-source Python library providing comprehensive tools for mass spectrometry data analysis in proteomics and metabolomics research. It offers Python bindings to the OpenMS C++ library, enabling efficient processing of LC-MS/MS data, peptide identification, feature detection, quantification, and integration with common proteomics tools like Comet, Mascot, MSGF+, Percolator, and MSstats.
|
||||
pyOpenMS is an open-source Python library for mass spectrometry data analysis in proteomics and metabolomics. Process LC-MS/MS data, perform peptide identification, detect and quantify features, and integrate with common proteomics tools (Comet, Mascot, MSGF+, Percolator, MSstats) using Python bindings to the OpenMS C++ library.
|
||||
|
||||
Use this skill when working with mass spectrometry data analysis tasks, processing proteomics or metabolomics datasets, or implementing computational workflows for biomolecular identification and quantification.
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Processing mass spectrometry data (mzML, mzXML files)
|
||||
- Performing peak picking and feature detection in LC-MS data
|
||||
- Conducting peptide and protein identification workflows
|
||||
- Quantifying metabolites or proteins
|
||||
- Integrating proteomics or metabolomics tools into Python pipelines
|
||||
- Working with OpenMS tools and file formats
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -7,18 +7,11 @@ description: "Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF
|
||||
|
||||
## Overview
|
||||
|
||||
Pysam is a Python module for reading, manipulating, and writing genomic datasets. It provides a Pythonic interface to the htslib C-API, supporting multiple genomic file formats commonly used in bioinformatics and computational biology.
|
||||
Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.
|
||||
|
||||
**Key capabilities:**
|
||||
- Read/write SAM/BAM/CRAM alignment files (aligned sequencing reads)
|
||||
- Read/write VCF/BCF variant call files (genetic variants)
|
||||
- Access FASTA reference sequences with random access
|
||||
- Read FASTQ files (raw sequencing reads with quality scores)
|
||||
- Query tabix-indexed files (BED, GTF, GFF)
|
||||
- Perform pileup analysis for coverage calculations
|
||||
- Execute samtools and bcftools commands from Python
|
||||
## When to Use This Skill
|
||||
|
||||
**When to use this skill:**
|
||||
This skill should be used when:
|
||||
- Working with sequencing alignment files (BAM/CRAM)
|
||||
- Analyzing genetic variants (VCF/BCF)
|
||||
- Extracting reference sequences or gene regions
|
||||
|
||||
@@ -7,9 +7,18 @@ description: "Therapeutics Data Commons. AI-ready drug discovery datasets (ADME,
|
||||
|
||||
## Overview
|
||||
|
||||
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. It offers curated datasets spanning the entire therapeutics pipeline, from target discovery through clinical development, with standardized evaluation metrics and meaningful data splits.
|
||||
PyTDC is an open-science platform providing AI-ready datasets and benchmarks for drug discovery and development. Access curated datasets spanning the entire therapeutics pipeline with standardized evaluation metrics and meaningful data splits, organized into three categories: single-instance prediction (molecular/protein properties), multi-instance prediction (drug-target interactions, DDI), and generation (molecule generation, retrosynthesis).
|
||||
|
||||
The platform organizes therapeutic tasks into three major categories: single-instance prediction for properties of individual biomedical entities, multi-instance prediction for relationships between multiple entities, and generation for creating new therapeutic molecules.
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Working with drug discovery or therapeutic ML datasets
|
||||
- Benchmarking machine learning models on standardized pharmaceutical tasks
|
||||
- Predicting molecular properties (ADME, toxicity, bioactivity)
|
||||
- Predicting drug-target or drug-drug interactions
|
||||
- Generating novel molecules with desired properties
|
||||
- Accessing curated datasets with proper train/test splits (scaffold, cold-split)
|
||||
- Using molecular oracles for property optimization
|
||||
|
||||
## Installation & Setup
|
||||
|
||||
|
||||
@@ -1,13 +1,23 @@
|
||||
---
|
||||
name: pytorch-lightning
|
||||
description: Work with PyTorch Lightning for deep learning model training and research. This skill should be used when building, training, or deploying neural networks using PyTorch Lightning, organizing PyTorch code into LightningModules, configuring Trainers for multi-GPU/TPU training, implementing data pipelines with LightningDataModules, or working with callbacks, logging, and distributed training strategies (DDP, FSDP, DeepSpeed).
|
||||
description: "Deep learning framework (PyTorch Lightning). Organize PyTorch code into LightningModules, configure Trainers for multi-GPU/TPU, implement data pipelines, callbacks, logging (W&B, TensorBoard), distributed training (DDP, FSDP, DeepSpeed), for scalable neural network training."
|
||||
---
|
||||
|
||||
# PyTorch Lightning
|
||||
|
||||
## Overview
|
||||
|
||||
PyTorch Lightning is a deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. It automates training workflows, multi-device orchestration, and best practices from research labs. Use this skill when working with neural network training, scaling models across multiple GPUs/TPUs, or structuring deep learning projects professionally.
|
||||
PyTorch Lightning is a deep learning framework that organizes PyTorch code to eliminate boilerplate while maintaining full flexibility. Automate training workflows, multi-device orchestration, and implement best practices for neural network training and scaling across multiple GPUs/TPUs.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Building, training, or deploying neural networks using PyTorch Lightning
|
||||
- Organizing PyTorch code into LightningModules
|
||||
- Configuring Trainers for multi-GPU/TPU training
|
||||
- Implementing data pipelines with LightningDataModules
|
||||
- Working with callbacks, logging, and distributed training strategies (DDP, FSDP, DeepSpeed)
|
||||
- Structuring deep learning projects professionally
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -7,11 +7,11 @@ description: "Single-cell RNA-seq analysis. Load .h5ad/10X data, QC, normalizati
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides comprehensive support for analyzing single-cell RNA-seq data using scanpy, a scalable Python toolkit built on AnnData. Use this skill for complete single-cell workflows including quality control, normalization, dimensionality reduction, clustering, marker gene identification, visualization, and trajectory analysis.
|
||||
Scanpy is a scalable Python toolkit for analyzing single-cell RNA-seq data, built on AnnData. Apply this skill for complete single-cell workflows including quality control, normalization, dimensionality reduction, clustering, marker gene identification, visualization, and trajectory analysis.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate this skill when:
|
||||
This skill should be used when:
|
||||
- Analyzing single-cell RNA-seq data (.h5ad, 10X, CSV formats)
|
||||
- Performing quality control on scRNA-seq datasets
|
||||
- Creating UMAP, t-SNE, or PCA visualizations
|
||||
|
||||
@@ -7,9 +7,7 @@ description: "Biological data toolkit. Sequence analysis, alignments, phylogenet
|
||||
|
||||
## Overview
|
||||
|
||||
scikit-bio is a comprehensive Python library for working with biological data. Provide assistance with bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics. This skill enables efficient work with common biological file formats and computational workflows in genomics, metagenomics, and ecological research.
|
||||
|
||||
**Key applications:** Sequence analysis, phylogenetic tree construction, microbiome diversity analysis, ecological statistics, biological data manipulation, and format conversion.
|
||||
scikit-bio is a comprehensive Python library for working with biological data. Apply this skill for bioinformatics analyses spanning sequence manipulation, alignment, phylogenetics, microbial ecology, and multivariate statistics.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
|
||||
@@ -7,11 +7,11 @@ description: "ML toolkit. Classification, regression, clustering, PCA, preproces
|
||||
|
||||
## Overview
|
||||
|
||||
This skill provides comprehensive guidance for using scikit-learn, Python's premier machine learning library. Scikit-learn offers simple, efficient tools for predictive data analysis, including classification, regression, clustering, dimensionality reduction, model selection, and preprocessing. This skill should be used when implementing machine learning workflows, building predictive models, analyzing datasets using supervised or unsupervised learning, preprocessing data for ML tasks, evaluating model performance, or optimizing hyperparameters.
|
||||
Scikit-learn is Python's premier machine learning library, offering simple and efficient tools for predictive data analysis. Apply this skill for classification, regression, clustering, dimensionality reduction, model selection, preprocessing, and hyperparameter optimization.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate this skill when:
|
||||
This skill should be used when:
|
||||
- Building classification models (spam detection, image recognition, medical diagnosis)
|
||||
- Creating regression models (price prediction, forecasting, trend analysis)
|
||||
- Performing clustering analysis (customer segmentation, pattern discovery)
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Statistical visualization. Scatter, box, violin, heatmaps, pair pl
|
||||
|
||||
## Overview
|
||||
|
||||
Seaborn is a Python visualization library providing a high-level, dataset-oriented interface for creating publication-quality statistical graphics. Built on matplotlib, seaborn emphasizes declarative syntax that allows focus on data relationships rather than visual implementation details. The library excels at multivariate analysis, automatic statistical estimation, and creating complex multi-panel figures with minimal code.
|
||||
Seaborn is a Python visualization library for creating publication-quality statistical graphics. Use this skill for dataset-oriented plotting, multivariate analysis, automatic statistical estimation, and complex multi-panel figures with minimal code.
|
||||
|
||||
## Design Philosophy
|
||||
|
||||
|
||||
@@ -7,11 +7,11 @@ description: "Statistical modeling toolkit. OLS, GLM, logistic, ARIMA, time seri
|
||||
|
||||
## Overview
|
||||
|
||||
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. This skill provides comprehensive guidance for conducting rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
|
||||
Statsmodels is Python's premier library for statistical modeling, providing tools for estimation, inference, and diagnostics across a wide range of statistical methods. Apply this skill for rigorous statistical analysis, from simple linear regression to complex time series models and econometric analyses.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Activate this skill when:
|
||||
This skill should be used when:
|
||||
- Fitting regression models (OLS, WLS, GLS, quantile regression)
|
||||
- Performing generalized linear modeling (logistic, Poisson, Gamma, etc.)
|
||||
- Analyzing discrete outcomes (binary, multinomial, count, ordinal)
|
||||
|
||||
@@ -7,7 +7,7 @@ description: "Graph Neural Networks (PyG). Node/graph classification, link predi
|
||||
|
||||
## Overview
|
||||
|
||||
PyTorch Geometric is a library built on PyTorch that enables development and training of Graph Neural Networks (GNNs) for applications involving structured data. It provides comprehensive tools for deep learning on graphs and other irregular structures (geometric deep learning), including mini-batch processing, multi-GPU support, and extensive benchmark datasets.
|
||||
PyTorch Geometric is a library built on PyTorch for developing and training Graph Neural Networks (GNNs). Apply this skill for deep learning on graphs and irregular structures, including mini-batch processing, multi-GPU training, and geometric deep learning applications.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
|
||||
@@ -1,26 +1,17 @@
|
||||
---
|
||||
name: torchdrug
|
||||
description: Toolkit for graph-based drug discovery and molecular machine learning using TorchDrug. This skill should be used when working with molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis prediction, or implementing graph neural networks for drug discovery and chemical/biological data. Use when tasks involve SMILES strings, molecular graphs, protein structures, drug datasets, binding prediction, or any PyTorch-based graph ML for life sciences.
|
||||
description: "Graph-based drug discovery toolkit. Molecular property prediction (ADMET), protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis, GNNs (GIN, GAT, SchNet), 40+ datasets, for PyTorch-based ML on molecules, proteins, and biomedical graphs."
|
||||
---
|
||||
|
||||
# TorchDrug
|
||||
|
||||
## Overview
|
||||
|
||||
TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. It provides graph neural networks, pre-trained models, datasets, and task definitions for working with molecules, proteins, and biological knowledge graphs.
|
||||
|
||||
**Core Capabilities:**
|
||||
- Molecular property prediction (drug-likeness, toxicity, quantum properties)
|
||||
- Protein modeling (function, structure, interactions)
|
||||
- Knowledge graph reasoning (drug-disease associations, biomedical knowledge)
|
||||
- Molecular generation (de novo drug design)
|
||||
- Retrosynthesis planning (synthetic route prediction)
|
||||
- 40+ curated datasets for chemistry and biology
|
||||
- 20+ state-of-the-art model architectures
|
||||
TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when working with:
|
||||
This skill should be used when working with:
|
||||
|
||||
**Data Types:**
|
||||
- SMILES strings or molecular structures
|
||||
|
||||
@@ -7,9 +7,7 @@ description: Work with state-of-the-art machine learning models for NLP, compute
|
||||
|
||||
## Overview
|
||||
|
||||
The Transformers library provides state-of-the-art machine learning models for natural language processing (NLP), computer vision, audio processing, and multimodal tasks. It offers over 1 million pre-trained model checkpoints and supports quick inference through pipelines, comprehensive training via the Trainer API, and flexible text generation with various decoding strategies.
|
||||
|
||||
This skill provides comprehensive guidance on working with Transformers across all major task types and modalities.
|
||||
The Transformers library provides state-of-the-art machine learning models for NLP, computer vision, audio, and multimodal tasks. Apply this skill for quick inference through pipelines, comprehensive training via the Trainer API, and flexible text generation with various decoding strategies.
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
|
||||
@@ -7,15 +7,7 @@ description: "UMAP dimensionality reduction. Fast nonlinear manifold learning fo
|
||||
|
||||
## Overview
|
||||
|
||||
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique designed for both visualization and general non-linear dimensionality reduction. It is faster than t-SNE while producing comparable or superior results, and uniquely scales well to higher embedding dimensions (beyond 2D/3D). UMAP preserves both local and global structure in data and supports supervised learning, making it versatile for visualization, clustering preprocessing, and feature engineering.
|
||||
|
||||
**Key capabilities:**
|
||||
- Fast, scalable dimensionality reduction for visualization
|
||||
- Supervised and semi-supervised learning with label information
|
||||
- Effective preprocessing for density-based clustering (HDBSCAN)
|
||||
- Transform new data using trained models
|
||||
- Parametric embeddings via neural networks
|
||||
- Inverse transforms for data reconstruction
|
||||
UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique for visualization and general non-linear dimensionality reduction. Apply this skill for fast, scalable embeddings that preserve local and global structure, supervised learning, and clustering preprocessing.
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
||||
@@ -7,15 +7,7 @@ description: "Chunked N-D arrays for cloud storage. Compressed arrays, parallel
|
||||
|
||||
## Overview
|
||||
|
||||
Zarr is a Python library for storage of large N-dimensional arrays that are chunked and compressed. It provides a NumPy-like API but divides data into manageable chunks stored separately, enabling efficient parallel I/O, cloud-native workflows, and seamless integration with the scientific Python ecosystem (NumPy, Dask, Xarray).
|
||||
|
||||
**Key capabilities:**
|
||||
- Create and manipulate N-dimensional arrays with NumPy-like semantics
|
||||
- Configure chunking strategies for optimal parallel access and performance
|
||||
- Apply compression algorithms (Blosc, Zstandard, Gzip, etc.) to reduce storage
|
||||
- Use flexible storage backends: local filesystem, memory, ZIP files, or cloud storage (S3, GCS)
|
||||
- Organize data hierarchically using groups (similar to HDF5)
|
||||
- Integrate seamlessly with Dask for parallel computing and Xarray for labeled arrays
|
||||
Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.
|
||||
|
||||
## Quick Start
|
||||
|
||||
|
||||
Reference in New Issue
Block a user